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Abstract 

We study theoretical properties of regularized robust M-estimators, applicable when 
data are drawn from a sparse high-dimensional linear model and contaminated by heavy¬ 
tailed distributions and/or outliers in the additive errors and covariates. We first establish 
a form of local statistical consistency for the penalized regression estimators under fairly 
mild conditions on the error distribution: When the derivative of the loss function is 
bounded and satisfies a local restricted curvature condition, all stationary points within 
a constant radius of the true regression vector converge at the minimax rate enjoyed by 
the Lasso with sub-Gaussian errors. When an appropriate nonconvex regularizer is used 
in place of an ^i-penalty, we show that such stationary points are in fact unique and 
equal to the local oracle solution with the correct support—hence, results on asymptotic 
normality in the low-dimensional case carry over immediately to the high-dimensional 
setting. This has important implications for the efficiency of regularized nonconvex M- 
estimators when the errors are heavy-tailed. Our analysis of the local curvature of the loss 
function also has useful consequences for optimization when the robust regression function 
and/or regularizer is nonconvex and the objective function possesses stationary points 
outside the local region. We show that as long as a composite gradient descent algorithm 
is initialized within a constant radius of the true regression vector, successive iterates will 
converge at a linear rate to a stationary point within the local region. Furthermore, the 
global optimum of a convex regularized robust regression function may be used to obtain 
a suitable initialization. The result is a novel two-step procedure that uses a convex M- 
estimator to achieve consistency and a nonconvex M-estimator to increase efficiency. We 
conclude with simulation results that corroborate our theoretical findings. 


1 Introduction 

Ever since robustness entered the statistical scene in Box’s classical paper of 1953 [9], many sig¬ 
nificant steps have been taken toward analyzing and quantifying robust statistical procedures— 
notably the work of Tukey [59], Huber [28], and Hampel [22], among others. Huber’s seminal 
work on M-estimators [28] established asymptotic properties of a class of statistical estimators 
containing the maximum likelihood estimator, and provided initial theory for constructing 
regression functions that are robust to deviations from normality. Despite the substantial 
body of now existent work on robust M-estimators, however, research on high-dimensional 
regression estimators has mostly been limited to penalized likelihood-based approaches (e.g., 
[58, 16, 20, 52]). Several recent papers [46, 35, 36] have shed new light on high-dimensional M- 
estimators, by presenting a fairly unified framework for analyzing statistical and optimization 
properties of such estimators. However, whereas the M-estimators studied in those papers 
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are finite-sample versions of globally convex functions, many important M-estimators, such as 
those arising in classical robust regression, only possess convex curvature over local regions— 
even at the population level. In this paper, we present new theoretical results, based only 
on local curvature assumptions, which may be used to establish statistical and optimization 
properties of regularized M-estimators with highly nonconvex loss functions. 

Broadly, we are interested in linear regression estimators that are robust to the following 
types of deviations: 

(a) Model mis specification. The ordinary least squares objective function may be viewed 
as a maximum likelihood estimator for linear regression when the additive errors e* 
are normally distributed. It is well known that the Id-penalized ordinary least squares 
estimator is still consistent when the ej’s are sub-Gaussian [8, 61]; however, if the dis¬ 
tribution of the e*’s deviates more wildly from the normal distribution (e.g., the e*’s 
are heavy-tailed), the regression estimator based on the least squares loss no longer 
converges at optimal rates. In addition, whereas the usual regularity assumptions on 
the design matrix such as the restricted eigenvalue condition have been shown to hold 
with high probability when the covariates are sub-Gaussian [51, 55], we wish to devise 
estimators that are also consistent under weaker assumptions on the distribution of the 
covariates. 

(b) Outliers. Even when the covariates and error terms are normally distributed, the re¬ 
gression estimator may be inconsistent when observations are contaminated by outliers 
in the predictors and/or response variables [54]. Whereas the standard ordinary least 
squares loss function is non-robust to outliers in the observations, alternative estimators 
exist in a low-dimensional setting that are robust to a certain degree of contamination. 
We wish to extend this theory to high-dimensional regression estimators, as well. 

Inspired by the classical theory on robust estimators for linear regression [30, 41, 23], we study 
regularized versions of low-dimensional robust regression estimators and establish statistical 
guarantees in a high-dimensional setting. As we will see, the regularized robust regression 
functions continue to enjoy good behavior in high dimensions, and we can quantify the degree 
to which the high-dimensional estimators are robust to the types of deviations described 
above. 

Our first main contribution is to provide a general set of sufficient conditions under which 
optima of regularized robust M-estimators are statistically consistent, even in the presence 
of heavy-tailed errors and outlier contamination. The conditions involve a bound on the 
derivative of the regression function, as well as restricted strong convexity of the loss function 
in a neighborhood of constant radius about the true parameter vector, and the conclusions are 
given in terms of the tails of the error distribution. The notion of restricted strong convexity, 
as used previously in the literature [46, 2, 35, 36], traditionally involves a global condition 
on the behavior of the loss function. However, due to the highly nonconvex behavior of the 
robust regression functions of interest, we assume only a local condition of restricted strong 
convexity in the development of our statistical results. Consequently, our main theorem 
provides guarantees only for stationary points within the local region of strong curvature. 
We show that all such local stationary points are statistically consistent estimators for the 
true regression vector; when the covariates are sub-Gaussian, the rate of convergence agrees 
(up to a constant factor) with the rate of convergence for ^-penalized ordinary least squares 
regression with sub-Gaussian errors. We also use the same framework to study generalized 
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M-estimators and provide results for statistical consistency of local stationary points under 
weaker distributional assumptions on the covariates. 

The wide applicability of our theorem on statistical consistency of high-dimensional ro¬ 
bust M-estimators opens the door to an important question regarding the design of robust 
regression estimators, which is the topic of our second contribution: In the setting of heavy¬ 
tailed errors, if all regression estimators with bounded derivative are statistically consistent 
with rates agreeing up to a constant factor, what are the advantages of using a complicated 
nonconvex regression function over a simple convex function such as the Huber loss? In the 
low-dimensional setting, several independent lines of work provide reasons for using noncon¬ 
vex M-estimators over their convex alternatives [30, 56]. One compelling justification is from 
the viewpoint of statistical efficiency. Indeed, the log likelihood function of the heavy-tailed 
t-distribution with one degree of freedom gives rise to the nonconvex Cauchy loss, which is 
consequently asymptotically efficient [32]. In our second main theorem, we prove that by 
using a suitable nonconvex regularizer [16, 69], we may guarantee that local stationary points 
of the regularized robust M-estimator agree with a local oracle solution defined on the correct 
support. Thus, provided the sample size scales sufficiently quickly with the level of sparsity, 
results on asymptotic normality of low-dimensional M-estimators with a diverging number of 
parameters [29, 67, 50, 39, 25] may be used to establish asymptotic normality of the corre¬ 
sponding high-dimensional estimators, as well. In particular, when the loss function equals 
the negative log likelihood of the error distribution, stationary points of the high-dimensional 
M-estimator will also be efficient in an asymptotic sense. Our oracle result and subsequent 
conclusions regarding asymptotic normality resemble a variety of other results in the litera¬ 
ture on nonconvex regularization [17, 10, 33], but our result is stronger because it provides 
guarantees for all stationary points in the local region. Our proof technique leverages the 
primal-dual witness construction recently proposed in Loh and Wainwright [36]; however, we 
require a more refined analysis here in order to extend the result to one involving only local 
properties of the loss function. 

Our third and final contribution addresses algorithms used optimize our proposed M- 
estimators. Since our statistical consistency and oracle results only provide guarantees for the 
behavior of local solutions, we need to devise an optimization algorithm that always converges 
to a stationary point inside the local region. Indeed, local optima that are statistically in¬ 
consistent are the bane of nonconvex M-estimators, even in low-dimensional settings [19]. To 
remedy this issue, we propose a novel two-step algorithm that is guaranteed to converge to a 
stationary point within the local region of restricted strong convexity. Our algorithm consists 
of optimizing two separate regularized M-estimators in succession, and may be applied to 
situations where both the loss and regularizer are nonconvex. In the first step, we optimize a 
convex regularized M-estimator to obtain a sufficiently close point that is then used to initial¬ 
ize an optimization algorithm for the original (nonconvex) M-estimator in the second step. 
We use the composite gradient descent algorithm [ ] in both steps of the algorithm, and prove 

rigorously that if the initial point in the second step lies within the local region of restricted 
curvature, all successive iterates will continue to lie in the region and converge at a linear rate 
to an appropriate stationary point. Any convex, statistically consistent M -estimator suffices 
for the first step; we use the ^-penalized Huber loss in our simulations involving sub-Gaussian 
covariates with heavy-tailed errors, since global optima are statistically consistent by our ear¬ 
lier theory. Our resulting two-step estimator, which first optimizes a convex Huber loss to 
obtain a consistent estimator and then optimizes a (possibly nonconvex) robust M-estimator 
to obtain a more efficient estimator, is reminiscent of the one-step estimators common in 
the robust regression literature [7]— however, here we require full runs of composite gradient 
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descent in each step of the algorithm, rather than a single Newton-Raphson step. Note that 
if the goal is to optimize an M-estimator involving a convex loss and nonconvex regularizer, 
such as the SCAD-penalized Huber loss, our two-step algorithm is also applicable, where we 
optimize the ^-penalized loss in the first step. 


Related work: We close this section by highlighting three recent papers on related topics. 
The analysis in this paper most closely resembles the work of Lozano and Meinshausen [37], 
in that we study stationary points of nonconvex functions used for robust high-dimensional 
linear regression within a local neighborhood of the true regression vector. Although the 
technical tools we use here are similar, we focus on regression functions are expressible as M- 
estimators; the minimum distance loss function proposed in that paper does not fall into this 
category. In addition, we formalize the notion of basins of attraction for optima of noncon¬ 
vex M-estimators and develop a two-step optimization algorithm that consists of optimizing 
successive regularized M-estimators, which goes beyond their results about local convergence 
of a composite gradient descent algorithm. 

Another related work is that of Fan et al. [15]. While that paper focuses exclusively on 
developing estimation bounds for penalized robust regression with the Huber loss function, 
the results presented in our paper are strictly more general, since they hold for nonconvex 
M-estimators, as well. The analysis of the I'l-penalized Huber loss is still relevant to our 
analysis, however, because as shown below, its global convergence guarantees provide us with 
a good initialization point for the composite gradient algorithm that we will apply in the first 
step of our two-step algorithm. 

Finally, we draw attention to the recent work by Mendelson [44]. In that paper, careful 
derivations based on empirical process theory demonstrate the advantage of using differently 
parametrized convex loss functions tuned according to distributional properties of the additive 
noise in the model. Our analysis also reveals the impact of different parameter choices for 
the regression function on the resulting estimator, but the rates of Mendelson [44] are much 
sharper than ours (albeit agreeing up to a constant factor). However, our analysis is not lim¬ 
ited to convex loss functions, and covers nonconvex loss functions possessing local curvature, 
as well. Finally, note that while Mendelson [44] is primarily concerned with optimizing the 
regression estimator with respect to i\- and f^-error, our oracle results suggest that it may be 
instructive to consider second-order properties as well. Indeed, taking into account attributes 
such as the variance and asymptotic efficiency of the estimator may lead to a different pa¬ 
rameter choice for a robust loss function than if the primary goal is to minimize the bias alone. 

The remainder of our paper is organized as follows: In Section 2, we provide the basic 
background concerning M- and generalized M-estimators, and introduce various robust loss 
functions and regularizers to be discussed in the sequel. In Section 3, we present our main the¬ 
orem concerning statistical consistency of robust high-dimensional M-estimators and unpack 
the distributional conditions required for the assumptions of the theorem to hold for specific 
robust estimators through a series of propositions. We also present our main theorem concern¬ 
ing oracle properties of nonconvex regularized M-estimators, with a corollary illustrating the 
types of asymptotic normality conclusions that may be derived from the oracle result. Sec¬ 
tion 4 provides our two-step optimization algorithm and corresponding theoretical guarantees. 
We conclude in Section 5 with a variety of simulation results. A brief review of robustness 
measures is provided in Appendix A, and proofs of the main theorems and all supporting 
lemmas and propositions are contained in the remaining supplementary appendices. 
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Notation: For functions f(n) and g(n), we write f(n) ^ g(n ) to mean that f(n) < cg(n) for 
some universal constant c € ( 0 , oo), and similarly, f(n) £3 g(n) when f(n) > dg(n) for some 
universal constant c' € (0, oo). We write /(n) x g(n ) when /(n) ^ g(n) and /(n) ^ 5 ( 71 ) hold 
simultaneously. For a vector v € M p and a subset S C {1,..., p] , we write us € K 5 to denote 
the vector v restricted to S. For a matrix M, we write |||M ||| 2 to denote the spectral norm. 
For a function h : M p —> R, we write V/i to denote a gradient or subgradient of the function. 

2 Background and problem setup 

In this section, we provide some background on M - and generalized M -estimators for robust 
regression. We also describe the classes of nonconvex regularizers that will be covered by our 
theory. 

Throughout, we will assume that we have n i.i.d. observations {(x^, ydliki f rom the linear 
model 

Vi = xf /3* + 6i, VI < i < n, (1) 

where X* € M p , j/j € M, and f3* € M p is a A:-sparse vector. We also assume that x* _LL e* 
and both are zero-mean random variables. We are interested in high-dimensional regression 
estimators of the form 

/3 G arg min {C n (/3) + p\(/3)} , ( 2 ) 

IIp||i<-R 

where C n is the empirical loss function and p\ is a penalty function. For instance, the Lasso 
program is given by the loss £ n (/3) = \ Yh=i( x TP ~ V *) 2 an d penalty p x (/3) = A||/3||i, but 
this framework allows for much more general settings. Since we are interested in cases where 
the loss and regularizer may be nonconvex, we include the side condition ||/3||i < R in the 
program (2) in order to guarantee the existence of local/global optima. We will require 
R > ||/3*||i, so that the true regression vector (3* is feasible for the program. 

In the scenarios below, we will consider loss functions C n that satisfy 

E [V£ n (/?*)] = 0 . (3) 

When the population-level loss £(/?) := E [C n (/3)\ is a convex function, equation (3) implies 
that (3* is a global optimum of jC(/3). When L is nonconvex, the condition (3) ensures that (3* 
is at least a stationary point of the function. Our goal is to develop conditions under which 
certain stationary points of the program ( 2 ) are statistically consistent estimators for (3*. 

2.1 Robust M-estimators 

We wish to study loss functions C n that are robust to outliers and/or model misspecification. 
Consequently, we borrow our loss functions from the classical theory of robust regression in 
low dimensions; the additional regularizer p\ appearing in the program ( 2 ) encourages sparsity 
in the solution and endows it with appealing behavior in high dimensions. Here, we provide a 
brief review of M-estimators used for robust linear regression. For a more detailed treatment 
of the basic concepts of robust regression, see the books [30, 41, 23] and the many references 
cited therein. 

Let i denote the regression function defined on an individual observation pair (xj, yd). The 
corresponding M-estimator is then given by 

1 n 

£n(/3) = -^2e(xJ/3 - yi). (4) 

i =1 
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Note that 


E [VC n (/3*)} = E [e'{xfp* - yi )xi] = E [/(ej)xj] = E [^(e*)] • E [xj = 0, 

so the condition (3) is always satisfied. In particular, the maximum likelihood estimator cor¬ 
responds to the choice £(u) = — log p € (u), where p e is the probability density function of the 
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additive errors e,;. Note that when e t ~ IV(0,1), the MLE corresponds to the choice £{u) = 
and the resulting loss function is convex. 

Some of the loss functions that we will analyze in this paper include the following: 


Huber loss: We have 


£(u) = 


- 4 > 


if |it| 

if Id 


<£, 

>e- 


In this case, £ is a convex function. Although t" does not exist everywhere, £' is continuous 
and Halloo < £■ 


Tukey’s biweight: We have 


£(u) 



if M < £, 

if |u| > £. 


Note that £ is nonconvex. We also compute the first derivative 


and second derivative 



if M < £, 

if |it| > £, 



if M < 

if |it| > £. 


Note that £" is continuous. Furthermore, H^Hoo < 25 ^/ 5 ' ^ ne ma y check that Tukey’s biweight 
function is not an MLE. Furthermore, although £" exists everywhere and is continuous, £"' 
does not exist for u € j±£, ±-^=|. 


Cauchy loss: We have 

£( U ) = J lQ g (f ) 

Note that £ is nonconvex. When £ = 1, the function £(u) is proportional to the MLE for the 
f-distribution with one degree of freedom (a heavy-tailed distribution). This suggests that 
for heavy-tailed distributions, nonconvex loss functions may be more desirable from the point 
of view of statistical efficiency, although optimization becomes more difficult; we will explore 
this idea more fully in Section 3.3 below. For the Cauchy loss, we have 


£'(n) 


u 

1 + u 2 /£ 2 ’ 


and 


£"{u) 


1 — u 2 /£ 2 

(i + u 2 /e) 2 ' 
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In particular, \£'(u)\ is maximized when u 2 = £ 2 , so H^Hoo < §■ We may also check that 
IHloo < 1 and Hriloo < J?. 

Although second and third derivatives do not always exist for the loss functions above, 
a unifying property is that the derivative i' is bounded in each case. This turns out to be 
an important property for robustness of the resulting estimator. Intuitively, we may view a 
solution /3 of the program (2) as an approximate sparse solution to the estimating equation 
V£ n (/3) = 0, or equivalently, 

1 n 

- yi)xi = 0. (5) 

n ' 
i —1 

When /3 = /3*, equation (5) becomes 


-Vf'(h)A = 0. (6) 

n 

1=1 

In particular, if a pair (xj, yi) satisfies the linear model (1) but e* is an outlier, its contribution 
to the sum in equation (6) is bounded when (! is bounded, lessening the contamination effect 
of gross outliers. 

In the robust regression literature, a redescending M-estimator has the additional prop¬ 
erty that there exists £o > 0 such that \(f(u)\ = 0, for all |tt| > £o- Then £o is known as a 
finite rejection point , since outliers ( Xi,yi ) with |e*| > £o will be completely eliminated from 
the summand in equation (6). For instance, Tukey’s biweight function gives rise to a re¬ 
descending M-estimator. 1 Note that redescending Af-estimators will always be nonconvex, so 
computational efficiency will be sacrificed at the expense of finite rejection properties. For an 
in-depth discussion of redescending M-estimators vis-a-vis different measures of robustness, 
see the article by Shevlyakov et al. [56]. 

2.2 Generalized M-estimators 

Whereas the M-estimators described in Section 2.1 are robust with respect to outliers in the 
additive noise terms e*, they are non-robust to outliers in the covariates x*. This may be 
quantified using the concept of influence functions (see Appendix A). Intuitively, an outlier 
in Xj may cause the corresponding term in equation (6) to behave arbitrarily badly. This 
motivates the use of generalized M-estimators that downweight large values of x* (also known 
as leverage points). The resulting estimating equation is then defined as follows: 

n 

^2rj(xi,xfp-yi)xi = 0, (7) 

i= 1 

where g : W p x M —>• M is defined appropriately. As will be discussed in the sequel, generalized 
M-estimators may allow us to relax the distributional assumptions on the covariates; e.g., 
from sub-Gaussian to sub-exponential. 

We will focus on functions rj that take the form 


v( x i, n) = w(xi ) £'(ri ■ v(xi)), 


(8) 


1 The Cauchy loss has the property that lirriu^oo F'(w)| = 0 , but it is not redescending for any finite £o- 
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where w,v > 0 are weighting functions. Note that the M-estimators considered in Section 2.1 
may also be written in this form, where w = v = 1 . 

Some popular choices of 77 of the form presented in equation ( 8 ) include the following: 


Mallows estimator [38]: We take v(x) = 1 and w(x) of the form 


w{x) = min 



or 


w(x) = min 



(9) 


for parameters b > 0 and B G M pxp . Note that ||tc(x)a :||2 is indeed bounded for a fixed choice 
of b and B, since 

|Kiw|2 £ S si,iB_1|2:=i ' 0 - 

The Mallows estimator effectively shrinks data points for which ||xi ||2 is large toward an el¬ 
liptical shell defined by B, and likewise pushes small data points closer to the shell. 


Hill-Ryan estimator [26]: We take v{x) = w(x), where w is defined such that ||w;(x)x ||2 is 
bounded (e.g., equation (9)). In addition to downweighting the influence function similarly to 
the Mallows estimator, the Hill-Ryan estimator scales the residuals according to the leverage 
weight of the Xj’s. 


Schweppe estimator [45]: For a parameter B € R pxp , we take w(x) = and 

v(x) = ■yTTy- Like the Mallows estimator, Schweppe’s estimator downweights the contri¬ 
bution of data points with high leverage as a summand in the estimating equation (7). If 
H is locally linear around the origin and flattens out for larger values, Schweppe’s estimator 
additionally dampens the effect of a residual r* only when it is large compared to the leverage 
of X{. As discussed in Hampel et al. [23], Schweppe’s estimator is designed to be optimal in 
terms of a measure of variance robustness, subject to a bound on the influence function. 

Note that when 77 takes the form in equation ( 8 ), the estimating equation (7) may again 
be seen as a zero-gradient condition VC n (/3) = 0 , where 

£n(/3) := - XI ~7~4 £ (( x fP ~ ViMxi)) • (10) 

n j' v(Xi ) 

Under reasonable conditions, such as oddness of £' and symmetry of the error distribution, 
the condition (3) may be seen to hold (cf. condition 2 of Proposition 1 below and the following 
remark). The overall program for a generalized M-estimator then takes the form 

(- it, ~7~4 £ {( x f P _ yi) V ( x i)) +Pa(/3)| • 
y n ~[ v \ x i) j 


B G arg min 

Il0lli<* 









2.3 Nonconvex regularizers 

Finally, we provide some background on the types of regularizes we will use in our analysis 
of the composite objective function (2). Following the theoretical development of Loh and 
Wainwright [35, 36], we require the regularizer p\ to satisfy the following properties: 

Assumption 1 (Amenable regularizes). The regularizer is coordinate-separable: 

p 

p\{p) = 

3 = 1 

for some scalar function p\ : M i —> R. In addition: 

(i) The function 1 1 ->- p\(t) is symmetric around zero and p\(0) = 0. 

(ii) The function 1 H > p\(t) is nondecreasing on M + . 

(Hi) The function 1 1 —> is nonincreasing on M + . 

(iv) The function t >->■ p\(t) is differentiable for t / 0. 

(v) lim^ 0 + p' x (t) = A. 

(vi) There exists p > 0 such that the function 1 H > p\(t) + ^ t 2 is convex. 

(vii) There exists 7 € (0, 00 ) such that p\(t) = 0 for all t > 7 A. 

If p\ satisfies conditions (i)-(vi) of Assumption 1, we say that p\ is p- amenable. If p\ 
also satisfies condition (vii), we say that p\ is (//, 7 )- amenable [36]. In particular, if p\ is p- 
amenable, then q\(t) := A|f| — p\(t) is everywhere differentiable. Defining the vector version 
q\ : W —>• R accordingly, it is easy to see that ^||/ 3||2 — q\(/3) is convex. 

Some examples of amenable regularizes are the following: 


Smoothly clipped absolute deviation (SCAD) penalty: This penalty, due to Fan and 

Li [16], takes the form 


P\(t) := 


X\t\, 

t 2 — 2aA|£|+A 2 
2(a—1) ’ 

(a+l)A 2 

2 > 


for \t\ < A, 
for A < \t\ < aX, 
for |t| > aX, 


( 11 ) 


where a > 2 is fixed. The SCAD penalty is (p, 7 )-amenable, with p = and 7 = a. 


Minimax concave penalty (MCP): This penalty, due to Zhang [ 68 ], takes the form 

P\{t) := sign(t) A • (l-—) + dz, (12) 

where b > 0 is fixed. The MCP regularizer is (p, 7 )-amenable, with p = \ and 7 = b. 
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Standard ft-penalty: The ft-penalty p\(t) = X\t\ is an example of a regularizer that is 
O-amenable, but not (0, 7 )-amenable, for any 7 < 00 . 

As studied in detail in Loh and Wainwright [36] and leveraged in the results of Section 3.3 
below, using (p, 7 )-amenable regularizers allows us to derive a powerful oracle result concern¬ 
ing local stationary points, which will be useful for our discussion of asymptotic normality. 

3 Main statistical results 

We now present our core statistical results concerning stationary points of the high-dimensional 
robust M-estimators described in Section 2. We begin with a general deterministic result that 
ensures statistical consistency of stationary points of the program ( 2 ) when the loss function 
satisfies restricted strong convexity and the regularizer is ^-amenable. Next, we interpret the 
consequences of our theorem for specific M -estimators and generalized M-estimators through 
a series of propositions, and provide conditions on the distributions of the covariates and error 
terms in order for the assumptions of the theorem to hold with high probability. Lastly, we 
provide a theorem establishing that stationary points are equal to a local oracle estimator 
when the regularizer is nonconvex and (p, 7 )-amenable. 

Recall that (3 is a stationary point of the program (2) if 

(Vjr n 0) + Vp x 0),P-P) >0, 

for all feasible ft where with a slight abuse of notation, we write \7 p\(/3) = A sign(/3) — \7q\(/3) 
(recall that q\ is differentiable by our assumptions). In particular, the set of stationary points 
includes all local and global minima, as well as interior local maxima [ 6 , 11 ]. 

3.1 General statistical theory 

We require the loss function C n to satisfy the following local RSC condition: 

Assumption 2 (RSC condition). There exist a, t > 0 and a radius r > 0 such that 

<V£ n (ft) - V£ n (ft), ft - ft) > Clift - ftlll - T^llft - ftllf, (13) 

n 

for all ft, ft € M p such that ||ft — ftlft lift — ft ||2 < r. 

Note that the condition (13) imposes no conditions on the behavior of C n outside the ball 
of radius r centered at (3*. In this way, it differs from the RSC condition used in Loh and 
Wainwright [35], where a weaker inequality is assumed to hold for vectors outside the local 
region. This paper focuses on the local behavior of stationary points around (3*, since the 
loss functions used for robust regression may be more wildly nonconvex away from the origin. 
As discussed in more detail below, we will take r to scale as a constant independent of n,p, 
and k. The ball of radius r essentially cuts out a local basin of attraction around (3* in which 
stationary points of the M-estimator are well-behaved. Furthermore, our optimization results 
in Section 4 guarantee that we may efficiently locate stationary points within this constant- 
radius region via a two-step M-estimator. 

We have the following main result, which requires the regularizer and loss function to 
satisfy the conditions of Assumptions 1 and 2 , respectively. The theorem guarantees that 
stationary points within the local region where the loss function satisfies restricted strong 
convexity are statistically consistent. 
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Theorem 1 . Suppose C n satisfies the RSC condition (13) with fa = ft* and p\ is p-amenable, 
with |n < a. Suppose n > Cr 2 ■ klogp and R > ||/3*||i and 


A > max 


4]|V£ n 09*)]| c 


8 tR 


log V 
n 


(14) 


Let f3 be a stationary point of the program (2) such that \\/3 — (3* ||2 < r. Then exists and 
satisfies the bounds 


\\P-Ph< 


24A \fh 
4a — 3/z ’ 


and 


\\P-/3*\\i< 


96A k 
4 a — 3/i 


The proof of Theorem 1 is contained in Section B.l. Note that the statement of Theorem 1 is 
entirely deterministic, and the distributional properties of the covariates and error terms in the 
linear model come into play in verifying that the inequality (14) and the RSC condition (13) 
hold with high probability under the prescribed sample size scaling. 


Remark: Although Theorem 1 only guarantees the statistical consistency of stationary 
points within the local region of radius r, it is essentially the strongest conclusion one can 
draw based on the local RSC assumption (13) alone. The power of Theorem 1 lies in the fact 
that when r is chosen to be a constant and - = o(l), as is the case in our robust regres¬ 
sion settings of interest, all stationary points within the constant-radius region are actually 

guaranteed to fall within a shrinking ball of radius O centered around /3*. Hence, 

the stationary points in the local region are statistically consistent at the usual minimax rate 
expected for ^i-penalized ordinary least squares regression with sub-Gaussian data. As we will 
illustrate in more detail in the next section, if robust loss functions with bounded derivatives 
are used in place of the ordinary least squares loss, the statistical consistency conclusion of 
Theorem 1 still holds even when the additive errors follow a heavy-tailed distribution or are 
contaminated by outliers. 


3.2 Establishing sufficient conditions 

From Theorem 1 , we see that the key ingredients for statistical consistency of local stationary 
points are (i) the boundedness of ||V£ n (/3*)|| 0O in inequality (14), which ultimately dictates the 
^ 2 -rate of convergence of j3 to (3* up to a factor of \/k, and (ii) the local RSC condition (13) in 
Assumption 2. We provide more interpretable sufficient conditions in this section via a series 
of propositions. 

For the results of this section, we will require some boundedness conditions on the deriva¬ 
tives of the loss function £, which we state in the following assumption: 

Assumption 3. Suppose there exist ki,K 2 > 0 such that 

\l'(u)\ < «i, Vu, (15) 

l"{u) > —K> 2 , Vu. (16) 

Note that the bounded derivative assumption (15) holds for all the robust loss functions 
highlighted in Section 2 (but not for the ordinary least squares loss), and ki x ( in each 
of those cases. Furthermore, inequality (16) holds with K 2 = 0 when £ is convex and twice- 
differentiable, but the inequality also holds for nonconvex losses such as the Tukey and Cauchy 
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loss with ac 2 > 0. By a more careful argument, we may eschew the condition (16) if £ is a 
convex function that is in C 1 but not C 2 , as in the case of the Huber loss, since Theorem 1 
only requires first-order differentiability of C n and q \; however, we state the propositions with 
Assumption 3 for the sake of simplicity. 

We have the following proposition, which establishes the gradient bound (14) with high 
probability under fairly mild assumptions: 

Proposition 1. Suppose £ satisfies the bounded derivative condition (15) and the following 
conditions also hold: 

(1) w(xi)xi is sub-Gaussian with parameter a 

(2) Either 

(a) v(xi) = 1 and E[u;(a;i)a;i] = 0, or 

(b) E [£' (a ■ v(xi )) | Xi] = 0. 

With probability at least 1—ci exp(—C 2 logp), the loss function defined by equation (10) satisfies 
the bound 

IIV C, n (/3*)|| oo < 

V n 

The proof of Proposition 1 is a simple but important application of sub-Gaussian tail bounds 
and is provided in Appendix C.l. 

Remark: Note that for the unweighted M-estimator (4), conditions (1) and (2a) of Proposi¬ 
tion 1 hold when Xi is sub-Gaussian and E[xj] = 0. If the xfs are not sub-Gaussian, condition 
(1) nonetheless holds whenever w(xf)xi is bounded. Furthermore, condition (2b) holds when¬ 
ever e { ; has a symmetric distribution and £' is an odd function. We further highlight the fact 
that aside from a possible mild requirement of symmetry, the concentration result given in 
Proposition 1 is independent of the distribution of e*, and holds equally well for heavy-tailed 
error distributions. The distributional effect of the xfs is captured in the sub-Gaussian pa¬ 
rameter a w ; in settings where the contaminated data still follow a sub-Gaussian distribution, 
but the sub-Gaussian parameter is inflated due to large leverage points, using a weight func¬ 
tion as defined in equation (9) may lead to a significant decrease in the value of a w . This 
decreases the finite-sample bias of the overall estimator. 

Establishing the local RSC condition in Assumption 2 is more subtle, and the propositions 
described below depend in a more complex fashion on the distribution of the efs. As noted 
above, the statistical consistency result in Theorem 1 only requires Assumption 2 to hold 
when /?2 = fi*■ However, for the stronger oracle result of Theorem 2, we will require the full 
form of Assumption 2 to hold over all pairs (fii, fifi) hr the local region. We will quantify the 
parameters of the RSC condition in terms of an additional parameter T > 0, which is treated 
as a fixed constant. Define the tail probability 

6 t:=p(n>|), (17) 

and the lower-curvature bound 

cut '■= min £"(u) > 0 , (18) 

|u|<T 
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where £!' is assumed to exist on the interval [— T, T], We assume that T is chosen small enough 
so that olt > 0 . 

We first consider the case where the loss function takes the usual form of an unweighted 
M-estimator (4). We have the following proposition, proved in Appendix C.2: 


Proposition 2. Suppose the Xi’s are drawn from a sub-Gaussian distribution with parameter 
(7% and the loss function is defined by equation (4). Also suppose the bound 


ca, r . 


4/2 


+ exp — 


dT 2 \ 

a 2 r 2 J 


< 




OLT + K 2 


(19) 


holds. Suppose £ satisfies Assumption 3, and suppose the sample size satisfies n > cofclogp. 
With probability at least 1 — cexp(— d logp), the loss function C n satisfies Assumption 2 with 

A m in(£z) , C(a T + k 2 ) 2 ct 2 T 2 

a = cut ■ -, and r =- r. -. 

16 r z 


Remark: Note that for a fixed value of T, inequality (19) places a tail condition on the 
distribution of e* via the term ex- This may be interpreted as a bound on the variance of the 
error distribution when is sub-Gaussian, or a bound on the fraction of outliers when e* has 
a contaminated distribution. Furthermore, the exponential term decreases as a function of 
the ratio -y. Hence, for a larger value of et, the radius r will need to be smaller in order to 
satisfy the bound (19). This agrees with the intuition that the local basin of good behavior for 
the M-estimator is smaller for larger levels of contamination. Finally, note that although ay 
and k 2 are deterministic functions of the known regression function £ and could be computed, 
the values of A m i n (£a;) and a 2 are usually unknown a priori. Hence, Proposition 2 should be 
viewed as more of a qualitative result describing the behavior of the RSC parameters as the 
amount of contamination of the error distribution increases, rather than a bound that can be 
used to select a suitable robust loss function. 


The situation where C n takes the form of a generalized M-estimator (10) is more difficult to 
analyze in its most general form, so we will instead focus on verifying the RSC condition (13) 
for the Mallows and Hill-Ryan estimators described in Section 2.2. We will show that the 
RSC condition holds under weaker conditions on the distribution of the xfs. We have the 
following lemmas, proved in Appendices C.3 and C.4: 

Proposition 3 (Mallows estimator). Suppose the Xi’s are drawn from a sub-exponential dis¬ 
tribution with parameter a 2 and the loss function is defined by 

1 n 

£n(P) = - V ]w(xi)£(xj/3 - Vi), 

n z — J 

i= 1 

and w(xi ) = min jl, ^Bxi\\ 2 }' ^ so su PP ose the bound 

cb III B ~ l III 2 a x ^ e T 2 + ex P - 2 (ar'd- n 2 ) ' Amin ^ 

holds. Suppose £ satisfies Assumption 3, and suppose the sample size satisfies n > c^klogp. 
With probability at least 1 — cexp(— d logp), the loss function C n satisfies Assumption 2 with 

Amin (E [w(xi)xixj ]) C(oi T + k 2 ) 2 ct:(!T 2 

a = ay --—-, and r = -=-. 

16 r z 
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Proposition 4 (Hill-Ryan estimator). Suppose the loss function is defined by 


1 n 

£n(P) = ~ ^ w ( x i)£ {( x fP ~ yi)w(Xi)) , 

1=1 

where w(xf) = min|l, ^Iso suppose the bound 


cb 2 B 


-illl 2 i/a 


Jqn2 


12 \ T 


ef + exp - 


c'T 


b 2 \\B~ml a 2 r 2 


< 


OLT 


2(<xt + K2) 


■ A min (E [w(xi)xixj] ) (20) 


holds. Suppose £ satisfies Assumption 3, and suppose the sample size satisfies n > coklogp. 
With probability at least 1 — cexp (—d logp), the loss function C n satisfies Assumption 2 with 

Amin (w(xi)xixf) C(a T + K 2 )b 2 II B^ 1 II 2 T 2 

a = ax ■ --—-— -—, and r =-^-—-. 

16 r z 


Remark: Due to the presence of the weighting function w(xi), Proposition 3 imposes weaker 
distributional requirements on the xfs than Proposition 2, and the requirements imposed 
in Proposition 4 are still weaker. In fact, a version of Proposition 3 could be derived with 
w(xi) = min jl, jjjfijz }> which would not require the xfs to be sub-exponential. The tradeoff 
in comparing Proposition 4 to Propositions 2 and 3 is that although the RSC condition holds 
under weaker distributional assumptions on the covariates, the absolute bound b 2 |||R _1 ||| 2 
used in place of the sub-Gaussian/exponential parameter a 2 may be much larger. Hence, the 
relative size of ex and the radius r will need to be smaller in order for inequality (20) to be 
satisfied, relative to the requirement for inequality (19). 


In Section 5 below, we explore the consequences of Propositions 1-4 for heavy-tailed, 
outlier, and sub-exponential distributions. 


3.3 Oracle results and asymptotic normality 

As discussed in the preceding two subsections, penalized robust M-estimators produce local 
stationary points that enjoy £\- and ^-consistency whenever i' is bounded and the errors 
and covariates satisfy suitable mild assumptions. In fact, a distinguishing aspect of different 
robust regression loss functions i lies not in first-order comparisons, but in second-order 
considerations concerning the variance of the estimator. This is a well-known concept in 
classical robust regression analysis, where p is fixed, n —> 00 , and the objective function does 
not contain a penalty term. By the Cramer-Rao bound and under fairly general regularity 
conditions [32], the optimal choice of £ that minimizes the asymptotic variance in the low¬ 
dimensional setting is the MLE function, £{u) = —log p € (u), where p e is the probability 
density function of e*. When the class of regression functions is constrained to those with 
bounded influence functions (or bounded values of £'), however, a more complex analysis 
reveals that choices of £ corresponding, e.g., to the losses introduced in Section 2.2 produce 
better performance [30]. 

In this section, we establish oracle properties of penalized robust M-estimators. Our main 
result shows that under many of the assumptions stated earlier, local stationary points of the 
regularized M-estimators actually agree with the local oracle result, defined by 

r S ■■= arg min {£„(£)} ■ (21) 

^elR s : \\j3-h*h<r 
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This is particularly attractive from a theoretical standpoint, because the oracle result implies 
that local stationary points inherit all the properties of the lower-dimensional oracle estimator 
/3<?, which is covered by previous theory. 

Note that /3|? is truly an oracle estimator, since it requires knowledge of both the actual 
support set S of f3* and of j3* itself; the optimization of the loss function is taken only over a 
small neighborhood around /3*. In cases where C n is convex or global optima of C n that are 
supported on S lie in the ball of radius r centered around /?*, the constraint \\/3 — /3* || 2 < r may 
be omitted. If C n satisfies a local RSC condition (13), the oracle program ( 21 ) is guaranteed 
to be convex, as stated in the following simple lemma, proved in Appendix E.l: 

Lemma 1. Suppose C n satisfies the local RSC condition (13) and n > ^f-klogp. Then C n is 
strongly convex over the region S r := {j3 € R p : supp(/3) C S, ||/3 — fi *\\2 < r}. 

In particular, the oracle estimator /3g is guaranteed to be unique. 

Our central result of this section shows that when the regularizer is (//, 7 )-amenable and 
the loss function satisfies the local RSC condition in Assumption 2 , stationary points of the 
M-estimator (2) within the local neighborhood of j3* are in fact unique and equal to the oracle 
estimator (21). We also require a beta-min condition on the minimum signal strength, which 
we denote by := rnirijgs |/3*|. For simplicity, we state the theorem as a probabilistic 
result for sub-Gaussian covariates and the unweighted M-estimator (4); similar results could 
be derived for generalized M-estimators under weaker distributional assumptions, as well. 

Theorem 2. Suppose the loss function C n is given by the M-estimator (4) and is twice dif¬ 
ferentiable in the £ 2 -ball of radius r around /3*. Suppose the regularizer p\ is (/x, 7 )- amenable. 
Under the same conditions of Theorem 1, suppose in addition that ||/3*||i < ^ and 

and > C\j 7 A. Suppose the sample size satisfies n > cq maxjfe 2 , k logp}. With 

probability at least l —cexp(—c' mm{k, logp}),jiny stationary point f3 of the program (2) such 
that ||/3 — || 2 < r satisfies supp(/3) C S and fis = ■ 

The proof of Theorem 2 builds upon the machinery developed in the recent paper [36]. 
However, the argument here is slightly simpler, because we only need to prove the oracle 
result for stationary points within a radius r of j3*. For completeness, we include a proof 
of Theorem 2 in Section B.2, highlighting the modifications that are necessary to obtain the 
statement in the present paper. 

Remark: Several other papers [17, 10, 33] have established oracle results of a similar flavor, 
but only in cases where the M-estimator takes the form described in Section 2.1 and the loss 
function is convex. Furthermore, the results of previous authors only concern global optima 
and/or guarantee the existence of local optima with the desired oracle properties. Hence, 
our conclusions are at once more general and more complex, since we need a more careful 
treatment of possible local optima. 

In fact, since the oracle program (21) is essentially a fc-dimensional optimization problem, 
Theorem 2 allows us to apply previous results in the literature concerning the asymptotic 
behavior of low-dimensional M-estimators to simultaneously analyze the asymptotic distri¬ 
bution of and /3. Huber [29] studied asymptotic properties of M-estimators when the loss 

3 

function is convex, and established asymptotic normality assuming F- —>■ 0 , a result which was 
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improved upon by Yohai and Maronna [67]. Portnoy [50] and Mamrnen [39] extended these 
results to nonconvex M-estimators. Fewer results exist concerning generalized M-estimators: 
Bai and Wu [4] and He and Shao [24] established asymptotic normality for a fairly general 
class of estimators, but the assumption is that p is fixed and n —>• oo. He and Shao [25] 
extended their results to the case where p is also allowed to grow and proved asymptotic 
normality when p 1 ° gp —> 0 , assuming a convex loss. 

Although the overall M-estimator may be highly nonconvex, the restricted program (21) 
defining the oracle estimator is nonetheless convex (cf. Lemma 1 above). Hence, the standard 
convex theory for M-estimators with a diverging number of parameters applies without mod¬ 
ification. Since the regularity conditions existing in the literature that guarantee asymptotic 
normality vary substantially depending on the form of the loss function, we only provide a 
sample corollary for a specific (unweighted) case, as an illustration of the types of results on 
asymptotic normality that may be derived from Theorem 2. 


Corollary 1. Suppose the loss function C n is given by the M-estimator (4) and the regularizer 
p\ is (^, 7 )- amenable. Under the same conditions of Theorem 2, suppose in addition that 
I € C 3 , E€ (0, 00 ), and k > Clogn. Let /3 be any stationary point of the program (2) 

such that ||/3 - /3 *|| 2 < r. If -> 0 , then ||/3 -/?*|| 2 = O p (j/$\ U ! ^r Jl -> 0, then 

for any v € M p , we have 

^•u T cs-/n Aiv( 0,1), 

(j v 


where 


0 - 1 -.= 


E [*"(*)]-E (*'(e0) : 


T • V 


X T X 


n 


v. 


The proof of Corollary 1 is provided in Appendix D. Analogous results may be derived for 
other loss functions considered in this paper under slightly different regularity assumptions, by 
modifying appropriate low-dimensional results with diverging dimensionality (e.g., [50, 39]). 


4 Optimization 

We now discuss how our statistical theory gives rise to a useful two-step algorithm for optimiz¬ 
ing the resulting high-dimensional M-estimators. We first present some theory for the com¬ 
posite gradient descent algorithm, including rates of convergence for the regularized problem. 
We then describe our new two-step algorithm, which is guaranteed to converge to a stationary 
point within the local region where the RSC condition holds, even when the M-estimator is 
nonconvex. 


4.1 Composite gradient descent 

In order to obtain stationary points of the program (2), we use the composite gradient descent 
algorithm [47]. Denoting C n (p) := C n (j3) — q\(/3), we may rewrite the program as 

P € arg min {C n (/3) + A||/5||i} . 

\\P\h<R 


Then the composite gradient iterates are given by 


P 


t +1 


1 

€ arg mm < — 
m\i<R 2 


P- /3 4 - 


VCniP* 



( 22 ) 
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where r/ is the stepsize parameter. Defining the soft-thresholding operator S\/ v (/3) compo¬ 
nentwise according to 

S x/n : = sign^j) (\Pj\ “ “) + > 

a simple calculation shows that the iterates (22) take the form 

/?« = S A/ , (> - - (23) 

The following theorem guarantees that the composite gradient descent algorithm will 
converge at a linear rate to point near fi* as long as the initial point j3° is chosen close enough 
to f3* . We will require the following assumptions on C n , where 

T'(P i,/3 2 ) ;= CniP i) - £„(/ %) - (V£ n (/3 2 ), Pi - p 2 ) 


denotes the Taylor remainder. 


Assumption 4. Suppose C n satisfies the restricted strong convexity condition 

T\P i, &) > ot'\\Pi - /3 2 ||| - r'^H/3! - p 2 \\l 


(24) 


for all Pi,P 2 € W such that \\fi\ — P*\\ 2 , \\@2 — P *\\2 < J". In addition, suppose C n satisfies the 
restricted smoothness condition 

T'(Pi,P 2 ) < oT\\Pi - P 2 1|| + \\Pi - p 2 \\l yPi,P 2 € IT. (25) 

n 

Note that the definition of T' differs slightly from the definition of the related Taylor 
difference used in Assumption 2. However, one may verify the RSC condition (24) in exactly 
the same way as we verify the RSC condition (13) via the mean value theorem argument of 
Section 3.2, so we do not repeat the proofs here. The restricted smoothness condition (25) 
is fairly mild and is easily seen to hold with t" = 0 when the loss function l appearing in 
the definition of the M-estimator has a bounded second derivative. We will also assume for 
simplicity that q\ is convex, as is the case for the SCAD and MCP regularizes; the theorem 
may be extended to situations where q\ is nonconvex, given an appropriate quadratic bound 
on the Taylor remainder of q\. 

We have the following theorem, proved in Appendix B.3. It guarantees that as long as 
the initial point P° of the composite gradient descent algorithm is chosen close enough to 
P*, the log of the £ 2 -error between iterates p t and a global minimizer p of the regularized 
M-estimator (2) will decrease linearly with t up to the order of the statistical error \\P — /S*|| 2 - 


Theorem 3. Suppose C n satisfies the RSC condition (24) and the RSM condition (25), and 
suppose p\ is p-amenable with p <2 a and q\ is convex. Suppose the regularization parameters 
satisfy the scaling 


C max < ||VT n (/3*)|| 0O r 


logp I 


< A < 


n 


C'a 
~ R r ‘ 


Also suppose P is a global optimum of the objective (2) over the set \\fi — P*\\ 2 < §. Suppose 
r] > 2a" and 


n > 


4(2 t' + t") 


a' — p/2 


a' — p/2 + r]/2 a' — p/2 + rj/2 4 


■ R 2 log p. 


(26) 
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If f3° is chosen such that ||/3° — /3 *||2 < §, successive iterates of the composite gradient descent 
algorithm satisfy the bound 

- Ml < U + - + CT^^WP - ml) , Vt > T*(S), 

la — pi \ t n ) 

where 6 2 > is a tolerance parameter, k € (0,1), and T*(<5) = - ^ • 

Remark: It is not obvious a priori that even if (3° is chosen within a small constant radius 
of /3*, successive iterates will also remain close by. Indeed, the hard work to establish this 
fact is contained in the proof of Lemma 5 in Appendix B.3. Furthermore, note that we can¬ 
not expect a global convergence guarantee to hold in general, since the only assumption on 
C n is the local version of RSC. Hence, a local convergence result such as the one stated in 
Theorem 3 is the best we can hope for in this scenario. 

In the simulations of Section 5, we see cases where initializing the composite gradient 
descent algorithm outside the local basin of attraction where the RSC condition holds causes 
iterates to converge to a stationary point outside the local region, and the resulting stationary 
point is not consistent for f3*. Hence, the assumption imposed in Theorem 3 concerning the 
proximity of f3° to (3* is necessary in order to ensure good behavior of the optimization 
trajectory for nonconvex robust estimators. 

4.2 Two-step estimators 

As discussed in Section 3 above, whereas different choices of the regression function I with 
bounded derivative yield estimators that are asymptotically unbiased and satisfy the same 
^-bounds up to constant factors, certain M-estimators may be more desirable from the point 
of view of asymptotic efficiency. When t is nonconvex, we can no longer guarantee fast global 
convergence of the composite gradient descent algorithm—indeed, the algorithm may even 
converge to statistically inconsistent local optima. Nonetheless, Theorem 3 guarantees that 
the composite gradient descent algorithm will converge quickly to a desirable stationary point 
if the initial point is chosen within a constant radius of the true regression vector. We now 
propose a new two-step algorithm, based on Theorem 3, that may be applied to optimize 
high-dimensional robust M-estimators. Even when the regression function is nonconvex, our 
algorithm will always converge to a stationary point that is statistically consistent for /3*. 

Two-step procedure: 

(1) Run composite gradient descent using a convex regression function I with convex t\- 
penalty, such that tf is bounded. 

(2) Use the output of step (1) to initialize composite gradient descent on the desired high- 
dimensional M-estimator. 

According to our results on statistical consistency (cf. Theorem 1 ), step (1) will produce 

a global optimum (3 1 such that \\j3 l — /3* ((2 < c\j kX ° r f P 1 as long as the regression function I 

is chosen appropriately . 2 Under the scaling n > Cr 2 • k\ogp, we then have \\(3 l — / 3*||2 < r. 

2 The rate of convergence may be sublinear in the initial iterations [47], but we are still guaranteed to have 
convergence. 
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Hence, by Theorem 3, composite gradient descent initialized at /3 1 in step (2) will converge to 
a stationary point of the M-estimator at a linear rate. By our results of Section 3, the final 
output /3 2 in step ( 2 ) is then statistically consistent and agrees with the local oracle estimator 
if we use a (p, 7 )-amenable penalty. 

Remark Our proposed two-step algorithm bears some similarity to classical algorithms 
used for locating optima of robust regression estimators in low-dimensional settings. Recall 
the notion of a one-step M-estimator [7], which is obtained by taking a single step of the 
Newton-Raphson algorithm starting from a properly chosen initial point. Yohai [ 66 ] and 
Simpson et al. [57] study asymptotic properties of one-step GM- and MM-estimators in the 
setting where p is fixed, and show that the resulting regression estimators may simultane¬ 
ously enjoy high-breakdown and high-efficiency properties. Welsh and Ronchetti [65] present 
a finer-grained analysis of the asymptotic distribution and influence function of one-step M- 
estimators as a function of the initialization point. Most directly related is the suggestion 
of Hampel et al. [23] for optimizing redescending M-estimators using a one-step procedure 
initialized using a least median of squares estimator, in order to overcome the problem of 
nonconvexity and possibly multiple local optima; however, the method is mostly justified 
heuristically. Although each step of our two-step method involves running a composite gradi¬ 
ent descent algorithm fully until convergence, the overall goal is still to produce an estimator 
at the end of the second step that is more efficient and has better theoretical properties than 
the solution of the first step alone. 

The simulations in the next section demonstrate the efficacy of our two-step algorithm 
and the importance of step ( 1 ) in obtaining a proper initialization to the composite gradient 
procedure in step ( 2 ). 

5 Simulations 

In this section, we expound upon some concrete instances of our theoretical results and provide 
simulation results. Throughout, we generate i.i.d. data from the linear model 

Hi = xjf3* + ej, VI < i < n. 

5.1 Statistical consistency 

In the first set of simulations, we verify the ^-consistency of high-dimensional robust regres¬ 
sion estimators when data are generated from various distributions. 

We begin our discussion with a lemma that demonstrates the failure of the Lasso to achieve 

the minimax O ra t e when the e,;’s are drawn from an a-stable distribution with 

a < 2. Recall that a variable X$ has an a-stable distribution with scale parameter 7 if the 
characteristic function of Xq is given by 

E[exp(RAo)] = exp (— 7 °jt|"), Vt € M, (27) 

and a € (0, 2] [48]. In particular, the standard normal distribution is an a-stable distribution 
with (a, 7 ) = ^2, ^=J, and the standard Cauchy distribution (also known as a f-distribution 
with one degree of freedom) is an a-stable distribution with (a, 7 ) = (1,1). The lemma is 
proved in Appendix E.2. 
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Lemma 2. Suppose X is a sub-Gaussian matrix and e is an i.i.d. vector of a-stable random 
variables with scale parameter 1. Suppose A x y^p. If a < 2 and log p = o we 

have 

> A ) > c a > 0, 

OO / 

where c a < 1 is a constant that depends only on the sub-Gaussian parameter of the rows of 
X and does not scale with the problem dimensions. In particular, if e is an i.i.d. vector of 
Cauchy random variables, the Lasso estimator is inconsistent. 



In contrast, as established in Theorem 1 and the propositions of Section 3.2, replacing the 
ordinary least squares loss by an appropriate robust loss function yields estimators that are 

consistent at the usual O ( y u ° sp ) rate. 


In our first set of simulations, we generated efs from a Cauchy distribution with scale 
parameter 0.1, and the xfs from a standard normal distribution. We ran simulations for 
three problem sizes: p = 128, 256, and 512, with sparsity level k ~ yfp. In each case, we 

set j3* = • • •, yp 0, • • •, 0^ . Figure 1(a) shows the results when the loss function C n is 

equal to the Huber, Tukey, and Cauchy robust losses, and the regularizer is the G-penalty. 
The estimator /3 was obtained using the composite gradient descent algorithm described in 
Section 4.1 in the case of the Huber loss, and the two-step algorithm described in Section 4.2 
in the cases of the Tukey and Cauchy losses, with the output of the Huber estimator used to 
initialize the second step of the algorithm. In each case, we set the regularization parameters 


A = 0.3y pp and R = 1.1 ||/3*||i, and averaged the results over 50 randomly generated data 
sets. As shown in the figure, the ^i-penalized robust regression functions all yield statistically 
consistent estimators. Furthermore, the curves for different problem sizes align when the l^.- 
error is plotted against the rescaled sample size pjogv' a S ree i n S with the theoretical bound in 
Theorem 1. 

We also ran a similar set of simulations when the efs were generated from a mixture of 
normals, representing a contaminated distribution with a constant fraction of outliers. With 
probability 0.7, the value of e* was distributed according to N( 0, (0.1) 2 ), and was otherwise 
drawn from a -/V(0,10 2 ) distribution. Figure 1(b) shows the results of the simulations. Again, 
we see that the robust regression functions all give rise to statistically consistent estimators 

with ^2-error scaling as O ( y - 1 ° SP ). We also include the plots for the standard Lasso 


estimator with the ordinary least squares objective. Since the distribution of e* is sub-Gaussian 
for the mixture distribution, the Lasso estimator is also ^-consistent; however, we see that 
the robust loss functions improve upon the ^2-error of the Lasso by a constant factor. 

Finally, we ran simulations to test the statistical consistency of generalized M-estimators 
under relaxed distributional assumptions on the covariates. We generated xfs from a sub¬ 
exponential distribution, given by independent chi-square variables with 10 degrees of freedom, 
and recentered to have mean zero. The e’s were drawn from a Cauchy distribution with 
scale parameter 0.1. We ran trials for problem sizes p = 128,256, and 512, with k ~ yp 
1 1 

y/k ’ ’ y/k 

Proposition 3, with b = 3, B = I p , and I equal to the Huber loss function, and optimized 
the function using the composite gradient descent algorithm with random initializations, with 

the regularization parameters A = 0.3 y^p and R = 1.1 ||/3*||i. Figure 2 shows the result 


, 0,..., 0). We used the G-penalized Mallows estimator described in 
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(a) (b) 

Figure 1 . Plots showing statistical consistency of ^-penalized robust regression functions, 
when the Xj’s are normally distributed but the e,’s follow a heavy-tailed or normal mixture 
distribution with a constant fraction of outliers. Each point represents an average over 50 
trials. The ^-error is plotted against the rescaled sample size fcl ” . Curves correspond to 
the Huber (solid), Tukey (dash-dotted), Cauchy (dotted), and ordinary least squares (dashed) 
losses, and are color-coded according to the problem sizes p = 128 (red), 256 (black), and 
512 (blue), (a) Plots for a heavy-tailed Cauchy error distribution. The Huber, Tukey, and 
Cauchy robust losses all yield statistically consistent results, as predicted by Theorem 1 and 
Propositions 1 and 2. (b) Plots for a mixture of normals error distribution with 30% large- 
variance outliers. Since the error distribution is sub-Gaussian, the ordinary least squares loss 
also yields a statistically consistent estimator at minimax rates, up to a constant prefactor; 
however, the robust regression losses provide a significant improvement in the prefactor. 
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I -error for robust regression losses, outliers in error 



Figure 2. Plot showing simulation results for the ^-penalized Mallows generalized M- 
estimator with a Huber loss function, when covariates are drawn from a sub-exponential 
distribution and errors are drawn from a heavy-tailed Cauchy distribution. Results for the 
.^-penalized Huber loss are shown for comparison. Each point represents an average over 50 
trials. Although both estimators appear to be statistically consistent, the Mallows estimator 
exhibits better performance. The plot agrees with the behavior predicted by Theorem 1 and 
Proposition 3. 


of the simulations, from which we observe that the Mallows estimator is indeed statistically 
consistent, as predicted by Theorem 1 and Proposition 3. We also plotted the results for 
^i-penalized Huber regression. It is not difficult to see from the proof of Theorem 1 that 

when the x^s are sub-exponential, but with 

a larger prefactor than the Mallows loss. We observe in Figure 2 that the Huber loss indeed 
appears to yield a statistically consistent estimator as well, but at a relatively slower rate. In 

our simulations, we needed a slightly larger value A = for the Huber loss in order to 

achieve statistical consistency. 


|V£ n (/3*)||oo is also of the order O f \J 


5.2 Convergence of optimization algorithm 

Next, we ran simulations to verify the convergence behavior of the composite gradient descent 
algorithm described in Section 4. We set p = 128, k ~ y/p, and n ~ 20/clogp, and generated 
€i s from a Cauchy distribution with scale parameter 0.1, and the x^s from a standard normal 
distribution. We set /3* = ^-^=,..., ^=, 0,..., 0^ . We then simulated the solution paths 
for the Huber and Cauchy loss functions with an £i-penalty, with regularization parameters 

A = 0.3^/pp and R = 1.1 ||/?*||i. Panel (a) of Figure 3 shows solution paths for the 
composite gradient descent algorithm with the Huber loss from 10 different starting points, 
chosen randomly from a A^(0,6 2 / p ) distribution. An estimate of the global optimum (5 was 
obtained from preliminary runs of the optimization error, and the log optimization error 
log(||/3 4 — /3H2) for each of the initializations was computed accordingly. In addition, we plot 
the statistical error log(||/3 — /PH2) in red for comparison. As seen in the plot, the log errors 
decay roughly linearly in t. Since the ^-penalized Huber objective is convex, our theory 
guarantees sublinear convergence of the iterates initially and then linear convergence locally 
around j3* within the radius as specified by Theorem 3. Indeed, our plots suggest nearly 
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(a) (b) 

Figure 3. Plots showing optimization trajectories for the composite gradient descent algorithm 
applied to various high-dimensional robust regression functions. The log of the ^ 2 -error is 
plotted against the iteration number for a fixed instantiation of the data using the Huber 
and Tukey loss. The errors are generated from a heavy-tailed Cauchy distribution. Solution 
paths are shown in blue and measured with respect to /3*; the statistical error is shown for 
reference and plotted in red. (a) Solution paths for the ^-penalized convex Huber loss with 10 
random initializations. All iterates converge to a unique optimum /?. Theorem 3 guarantees a 
rate of convergence that is linear on a log scale, once the iterates enter the region where the 
function satisfies restricted strong convexity, (b) Solution paths for the £i-penalized nonconvex 
Tukey loss with 10 random initializations from the £i-penalized Huber output (black); slight 
perturbations of /?* within the local basin where the loss function satisfies restricted strong 
convexity (green); and random initializations (blue). The black and green trajectories all 
converge at a linear rate to a unique stationary point in the local region, as predicted by 
Theorem 3. The blue iterates converge at a slower rate to an entirely different stationary 
point. This figure emphasizes the need for proper initialization of the composite gradient 
algorithm in order to locate a statistically consistent stationary point. 


linear convergence even outside the local RSC region. All iterates converge to the unique global 
optimum /3 (the apparent bifurcation is due to the small nonzero error tolerance provided in 
our implementation of the algorithm as a criterion for convergence.) 

Figure 3(b) shows solution paths using the ^-penalized Tukey loss. We plot the composite 
gradient descent iterates for 10 different starting points chosen by the output of composite 
gradient descent applied to the ^-penalized Huber loss (black) with random initializations; 
10 randomly chosen starting points given by /3* plus a A r (0, (0.1) 2 / p ) perturbation (green); 
and 10 randomly chosen starting points drawn from a A r (0,3 2 / p ) distribution (blue). The 
simulation results reveal a linear rate of convergence for composite gradient descent iterates 
in the first two cases, as predicted by Theorem 3, since the initial iterates lie within the 
local region around /3* where the Tukey loss satisfies the RSC condition. All of the black 
and green trajectories converge to the same unique stationary point in the local region. In 
the third case, however, the rate of convergence of composite gradient descent iterates is 
slower, and the iterates actually converge to a different stationary point further away from 
j3 *. This emphasizes the cautionary message that stationary points may indeed exist for 
nonconvex robust regression functions that are not consistent for the true regression vector, 
and first-order optimization algorithms may converge to these undesirable stationary points 
if initialized improperly. 
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5.3 Nonconvex regularization 


Finally, we ran simulations to verify the oracle results described in Section 3.3. Figure 4 
shows side-by-side comparisons for robust regression using the Huber and Cauchy loss func¬ 
tions with the SCAD penalty, with parameter a = 2.5. We ran simulations for p = 128, 256, 
and 512, with k ~ ^Jp and /3* = ..., ^=, 0,..., 0^ . The q’s were drawn from a Cauchy 

distribution with scale parameter 0.1, and the Xj’s were drawn from a standard normal dis¬ 
tribution. The -^-penalized Huber loss was used to select an initial point for the composite 
gradient descent algorithm, as prescribed by the two-step algorithm; in all cases, we set the 

regularization parameters to be A = and R = 1.1 11/5*111- Panel (a) plots the ^2-error 

versus the rescaled sample size , from which we see that both SCAD-penalized objective 
functions yield statistically consistent estimators. Panel (b) plots the fraction of trials (out of 
50) for which the recovered support of the estimator agrees with the true support of f3*. As 
we see, the families of curves for different loss functions stack up when the horizontal axis is 
rescaled according to . Furthermore, the probability of correct support recovery transi¬ 
tions sharply from 0 to 1 in panel (b), as predicted by Theorem 2. Note that the transition 
point for the Cauchy loss in panel (b), which happens for k ^ p ~ 8, also corresponds to a 

sharp drop in the ^-error in panel (a), since /3 is then equal to the low-dimensional oracle 
estimator. Panel (c) plots the empirical variance of y/n ■ eJ(/3 — /?*), the first component of 
the error vector rescaled by yfn. We see that the variance for the Cauchy loss is uniformly 
smaller than the variance for the Huber loss—indeed, the Cauchy loss corresponds to the 
MLE of the error distribution. Furthermore, the curves for each loss function roughly align 
for different problem sizes, and the variance is roughly constant for increasing n, as predicted 
by Corollary 1. Note that Corollary 1 requires third-order differentiability, so it does not 
directly address the Huber loss. However, the empirical variance of the Huber estimators is 
also roughly constant, suggesting that a version of Corollary 1 applicable to the Huber loss 
might be derived from the oracle results of Theorem 2. 


6 Discussion 

We have studied penalized high-dimensional robust estimators for linear regression. Our re¬ 
sults show that under a local RSC condition satisfied by many robust regression M-estimators, 
stationary points within the region of restricted curvature are actually statistically consistent 
estimators of the true regression vector, and even under heavy-tailed errors or outlier con¬ 
tamination, these estimators enjoy the same convergence rate as ^-penalized least squares 
regression with sub-Gaussian errors. Furthermore, we show that when the penalty is chosen 
from an appropriate family of nonconvex, amenable regularizers, the stationary point within 
the local RSC region is unique and agrees with the local oracle solution. This allows us to 
establish asymptotic normality of local stationary points under appropriate regularity condi¬ 
tions, and in some cases conclude that the regularized M-estimator is asymptotically efficient. 
Finally, we propose a two-step M-estimation procedure for obtaining local stationary points 
when the M-estimator is nonconvex, where the first step consists of optimizing a convex prob¬ 
lem to obtain a sufficiently close initialization for a final run of composite gradient descent in 
the second step. 

Several open questions remain that provide interesting avenues for future work. First, 
although the side constraint ||/3||i < R in the regularized M-estimation program (2) is required 
in our proofs to ensure that stationary points obey a cone condition, it is unclear whether this 
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(a) (b) 



(c) 


Figure 4. Plots showing simulation results for robust regression with a nonconvex SCAD 
regularizer, using a Huber loss (solid lines) and Cauchy loss (dashed lines), for three problem 
sizes: p = 128 (red), p = 256 (black), and p = 512 (blue). Each point represents an average 
over 50 trials, (a) Plot showing f 2 -error as a function of the rescaled sample size fcl ” gp . Both 
regularizers yield statistically consistent estimators, as predicted by Theorem 1. (b) Plot 

showing variable selection consistency. The probability of success in recovering the support 
transitions sharply from 0 to 1 as a function of the sample size, agreeing with the theoretical 
predictions of Theorem 2. The transition threshold corresponds with the sharp drop in t^-error 
seen in panel (a), since /3 agrees with the oracle result, (c) Plot showing the empirical variance 
of y/n ■ ej(f3 — /3*), the rescaled first component in the error vector. As predicted by the 
asymptotic normality result of Corollary 1, the empirical variance remains roughly constant 
for sufficiently large sample sizes. 
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side condition is necessary. Indeed, since we are only concerned with stationary points within 
a small radius r of j3*, the additional -constraint may be redundant. It would be useful to 
remove the appearance of R for practical problems, since we would then only need to tune 
the parameter A. Second, as a consequence of the oracle result in Theorem 2, local stationary 
points inherit other properties of the oracle solution j3® in addition to asymptotic normality, 
such as breakdown behavior and properties of the influence function. It would be interesting 
to explore these properties for robust M-estimators with a diverging number of parameters. 
A potentially harder problem would be to derive bounds on the measures of robustness for 
stationary points of regularized robust estimators when the oracle result does not hold (i.e., 
for .^-penalized robust M-estimators). Lastly, whereas our results on asymptotic normality 
allow us to draw conclusions regarding the asymptotic variance of the local oracle solution, it 
would be valuable to derive nonasymptotic bounds on the variance of high-dimensional robust 
M-estimators. By trading off the nonasymptotic bias and variance, one could then determine 
the form of a robust regression function that is optimal in some sense. 


A Measures of robustness 

Various methods exist in the classical literature for quantifying the robustness of statistical 
estimation procedures. In this section, we provide a review of breakdown points, influence 
functions, and asymptotic variance of robust estimators, and cite relevant literature. 

The finite-sample breakdown point of an estimator T n on the sample {xi\f =l is defined by 

FBP n (T; xi ,..., x n ) := — • min j m : max sup \T n (zi ,..., z n )\ = oo 

where (z\,... ,z n ) is the sample obtained from (x±,... ,x n ) by replacing the data points 
(xq,..., Xi m ) by (yi,, y m ) [23, 12]. One may verify that the finite-sample breakdown point 
is - for M-estimators of the type defined in Section 2.1 when i is convex [ ]. This provides 
another reason to use nonconvex loss functions in order to obtain a robust estimator. Al¬ 
though the breakdown behavior of M-estimators is much harder to characterize when the loss 
function is nonconvex, Maronna et al. [40] derived theoretical results showing the breakdown 
point decays as 0(p~ 1 / 2 ) when the xfs are Gaussian. More recently, Wang et al. [64] analyzed 
the breakdown point of a certain nonconvex penalized M-estimator, but their analysis is again 
very specific to the precise form of the estimator and requires careful data-dependent tuning 
of the scale parameter used in the objective function. Under suitable regularity conditions, 
taking the limit of the finite breakdown point as n —>• oo yields the asymptotic breakdown 
point , but the latter concept is more technical and we do not discuss it here. 

A second measure of robustness is given by the influence function. At the population 
level, the influence function of an estimator T on a distribution F with respect to a point 
(x, y) is defined by 



IF((x,y)-T,F) 


lim 

t-> 0+ 


T((l-t)F + tS M )-T(F) 


where St x ,y) is a point mass at (x,y). The gross error sensitivity is defined in terms of the 
influence function as 

GES(T, F) := sup \lF((x,y);T, F)\, 

(x,y) 
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and the estimator T is B-robust if GES(T, F) < oo [53]. In the linear regression case, let 
Fp denote the distribution on (xi,yi,€i) parametrized by /3. If T( minimizes the M-estimator 
defined in equation (4) and £ is twice differentiable, the influence function takes the form 

IF ((x, y); Tg, Fp) = t'(x T /3 - y) ■ (E [l"{xjp- Vi ) ■ x iX J])~ l x, (28) 


where the expectation is taken with respect to Fp [23, Section 6.3]. In particular, if the Xi’s 
are fixed and contamination is only allowed in the y£ s, the influence function in equation (28) 
is bounded as a function of y, provided £' is bounded. For a generalized M-estimator T v 
defined by equation (7), the influence function is given by 


IF (Od y)-,T v , Fp) = rj(x, x T (3 - y) ■ E 


dr](x, r) 
dr 


T 

XiX t 


tVi) 


(29) 


where the expectation is taken with respect to Fp [23]. In particular, if rj takes the form in 
equation (8), then equation (29) simplifies to 

IF((x, y)-,T v , Fp) = w(x) l' ((x T /3 - y)v(x)) • (E [w(xi)v(xi) ■ t ( nv(xi )) • x t xj]) 1 x, (30) 


and we see that the overall influence function is bounded whenever t' is bounded and w is 
defined in such a way that ||w;(a;)a :||2 is bounded. 

A finite-sample version of the influence function is known as the sensitivity curve , and 
under suitable regularity conditions, the sensitivity curve converges to the influence function 
as n —>• oo [23] . The literature concerning influence functions for high-dimensional estimators 
is again rather sparse, but has been a topic of recent interest [43, 49]. 

Finally, we turn to second-order considerations. In the classical low-dimensional setting 
when p is fixed and n —>• oo, Maronna and Yohai [42] show that under appropriate regularity 
conditions, the asymptotic variance of an M-estimator is given by 


V(T,F) = I IF((x,y)-,T,F)-IF((x,yy,T,F) dF(x,y). 


By the celebrated Cramer-Rao bound [32], when the Xj’s are fixed and the ej’s are i.i.d., the 
asymptotic variance V(T, F) of any unbiased estimator is bounded below by the inverse of the 
Fisher information of the underlying distribution. Furthermore, this lower bound is achieved 
when T is the MLE, in which case T is also asymptotically normally distributed [21, 32]. As 
pointed out in the previous paragraph, however, the influence function of the MLE may not 
be bounded, leading to a critical tradeoff in designing robust M-estimators. In addition, the 
behavior of the asymptotic variance is much harder to analyze when both n and p are allowed 
to grow. Several recent papers [14, 5, 13] examine the setting where ^ —» 5 £ (1,5), and 
show that the asymptotic variance of the (unregularized) M-estimator coming from a convex 
loss function includes an additional term not present in the classical fixed-p case. In contrast, 
we show that with the proper choice of nonconvex penalty, local solutions of nonconvex 
regularized M-estimators coincide with the oracle solution, so they inherit certain optimality 
properties from classical robust estimation theory. It is these higher-order considerations that 
reveal the true advantage of using nonconvex loss functions for robust M -estimation; although 
estimators such as the LAD Lasso [62, 63] may also be shown to be statistically consistent 
under reasonable assumptions, the LAD loss is a suboptimal choice from the viewpoint of 
asymptotic efficiency, under the high-dimensional scaling n > Cklogp and oracle conditions, 
unless the additive errors follow a double exponential distribution 
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B Proofs of main theorems 

In this Appendix, we provide the proofs of the main theorems stated in the text of the paper. 

B.l Proof of Theorem 1 

We first suppose the existence of stationary points in the local region; we will establish that 
fact at the end of the proof. Suppose ft is a stationary point such that \\/3 — (3* H2 < r. Since 
/3 is a stationary point and /3* is feasible, we have the inequality 

(VC n 0) - Vq\(/3) + A sign(/3), /?* - 0) > 0. (31) 

By the convexity of f ||/3||| — q\(/3), we have 

(Vq A 0), P*~P)> qx(H ~ qx(P) - | W- P*\\l (32) 

so together with inequality (31), we have 

(V£ n (/3) + Asign(^), P*-P)> qxtf*) - qx0) ~ f \\P~ P*\\l 
Since (sign(/3), /3* — (3) < ||/3*||i — ||/3||i, this means 


(V£„(/3), p-P)> P m - p x (/3*) n\l (33) 

Now denote v := /3 — /3*. From the RSC inequality (13), we have 

<V£„ 0) - V£„(/?*), p-p)> a\\m - r^Flli- (34) 

n 

Combining inequality (34) with inequality (33), we then have 

(« - f ) Fill - II? + (pS) - PxVTj) < (V£n(r), P* - P), (35) 

so by Holder’s inequality, we conclude that 

(a - F||| - r —|pF111 + (p\@) ~ P\(P*)) < ||V£ n (/3*)HooFlli- (36) 


In particular, under the assumed scaling A > 4||V£„(/3*)|| 00 and A > 8rR^, we have 



implying that 

Fill < \px(P) ~ \pS). (37) 
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(38) 


By Lemma 5 in Loh and Wainwright [35], we then have 

0 < 3p\((3*) - p\(/3) < A(3||?a||i - Fa c ||i)> 

where A is the index set of the k largest elements of v in magnitude. Hence, 


V4 c ||i 


3\\va\ 


implying that 

IPIIl = Fa||i + Fa=||i < 411 Va 111 < 4vFF|| 2 
Combining inequalities (37) and (38) then gives 

3 p 


(39) 


a — 


~n 2 3A .. A 3A .. /~r ___.__ 

v \\2 < -y Fa|| i - -Fa=||i < — Fa|| i < 6 Avfc||z/|| 2 , 


from which we conclude that 


Wh < 


24A Vk 


(40) 


4a — 3 p 

as wanted. Combining the .^-bound with inequality (39) then yields the ^i-bound. 

Finally, in order to establish the existence of stationary points, we simply define /3 € K p 
such that 

/3 € arg min {£ n (/3) + p\(/3)} ■ (41) 

\\P-P*h<r, m\l<R 

Then j3 is a stationary point of the program (41), so by the argument just provided, we have 


II 

V n 

Provided n > Cr 2 • k\ogp, the point (3 will lie in the interior of the sphere of radius r around 
f3*. Hence, f3 is also a stationary point of the original program (2), guaranteeing the existence 
of such local stationary points. 


B.2 Proof of Theorem 2 

This argument is an adaptation of the proofs of Theorems 1 and 2 in the recent paper [36]. 
We follow the primal-dual witness construction introduced there: 

(i) Optimize the restricted program 

As € arg min {£ n (A) + P\(P)}, (42) 

pm s : ||/9||1<R 


and establish that ||As||i < R. 

(ii) Define zg € 3||/3s||i, and choose zs c to satisfy the zero-subgradient condition 

V£ n 0)-Vq x 0) + \z = O, (43) 

where z = {z's , As' c ) and (3 := (As,0s c )- Show that (3s = (3 ® and establish strict dual 
feasibility: ]|2srcHoc < 1. 

(iii) Verify via second-order conditions that (3 is a local minimum of the program (2) and 
conclude that all stationary points (3 satisfying ||(3 — /?*||2 < r are supported on S. 


29 





Step (i): By Theorem 1 applied to the restricted program (42), we have 


II As ^5 111 — 


80Afc 
2a — fj, ’ 


so 

llfelll < llfelll + life -ffilll < f < R > 

2 2a — /i 

using the assumptions of the theorem. This establishes step (i) of the PDW construction. 


Step (ii): Since Ps is an interior point of the restricted program (42), it must satisfy a 
zero-subgradient condition on the restricted program, implying that we may define zs c to 
satisfy equation (43). We rewrite the zero-subgradient condition (43) as 

(yC n {P) - V£ n (/T)) + (v£„(/T) - V(/a(/3)) + A£= 0, 

and by the fundamental theorem of calculus, 

Q(P - P* ) + (v£ n (r) - VqxW)) +Xz = 0, 

where Q := Jq V 2 C n (j.3* +1(/3 — /3*)j dt. In block form, this means 


Qss Qss c 

%-P*s 

4- 

' VCn{P*)s-Vq X (P S ) ' 

+ A 

zs 

Qs c s Qs c s c 

0 


yC n {P*) S o - Vq X (P S c)_ 

1 

_Z S C_ 


We now have the following lemma, concerning the oracle estimator: 
Lemma 3. Under the conditions of Theorem 2, we have the bound 



and p s = /3%. 

Proof. By the optimality of the oracle estimator, we have 

Cn(P°) < £n(/n 


(45) 


Furthermore, C n is strongly convex over the restricted region S r by Lemma 1. Hence, 


£n03*) + (V£ n (/3*), P° - p*) + C -\\p° - P*\\l < C n (p°). (46) 

Summing inequalities (45) and (46), we obtain 

f ii p° - nil < (v£„(r), p* - pP) < iiv^/mu • ii p° - nii 

<^iiv£ n (r)iioo-n^-ni2, 

implying that 

WP°~P*h<^m n (P*)\U 

a 
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Iii particular, when 


4 y/k 


a 


V£„(/3*)|U <r, 


the oracle estimator (5° is in the interior point of the feasible region, implying in particular 
that 

(y£n@°j) s = 0 . 

Hence, we have 


or 

where 

Q° : = 

This implies that 


V£„(/§°) - V£ n (p)) s + (V£ n (^*)) s = 0 , 


Q%s(P°-/3*)s + (V£ n (n)s = 0 , 


:= £ V 2 £ n (/?* + t(3° - r)) dt = (£ (!' ( 


+ t(p 0 - f3*)) dt = ( I l" ( xj (/ 3 * + t(/3° - f3 *)) -yi)dt) ■ xixj. 


\\Ps-P*s\\oo= (Q 0 )^(V£ n (r)) 5 • (47) 

OO 

Let !>f(l) := {u : ||tt ||2 < 1, and supp(u) C 5}. For v,w € Bf(l), consider the quantity 

v T [qss - (V 2 £ n (/3*)) 5S }™ 

{/ ( x f 0 3 * + *(^° - £*)) - y») - - yi)) dtj (xjv)(xjw) 

-1 72 /»]_ 

- K3 ' ~Y1 J ’ \ x i (@° ~ P*)\ <&) ' 1^1 • 


2 — 1 
72 «1 


K3 ‘ “ S ^(P° ~ P*)\ ' l x ^l ‘ 

n i=i ^0 




<«3||^-r||2 


||u|| 2 =l, supp(u)CS [ n i=1 

and denote f(u,v,w ) := 4 ^” =1 \xfu\ ■ \xfv\ ■ \xfw\. Then 

Qss- (V 2 £ n (/T)) ss < k 3 \\P° - p*h ■ sup f(u,v,w). 

2 U,v,w£ Bgfl) 


(48) 


We now use a covering argument. Let M. denote a |-cover of 182 ( 1 ). By standard results on 
metric entropy, we may choose M. such that \M.\ < c k . For all triples u,v,w £ Bf (1), we may 
find v! ,v' ,w' € M. such that 


lit — l/IL, llu — l/lh, \\w — w'\\ 2 <—- 

4 


Furthermore, 


\f(u,v,w) - f(u',v',w')\ < | f(u,v,w) - f(u',v,w)\ 

+ I f{u',v,w) - f{u',v',w )| + | f(u',v',w) - f(u',v',w')\. (49) 
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Note that 


I f{u,v,w) - f(u',v,w)\ = 


n 

-T, 

n 


\Xa U\ — \X; U • \X; V\ ■ \Xi w\ 


i=1 


< 


n 

i= 1 

1 U 


l T i i T i\ 
\Xi U\ — \Xi U 


I T i i T i 
\Xi V\ • \Xj W\ 


u — u ')| • |xfu| • \xjw\ 


1=1 


< \\u — u'\\l ■ sup f(u,v,w ) 

U,V,W^1S>2 ( 1 ) 

— T SUp f(u,V,w), 

^ lt,V,lwG®2 (1) 

and we may bound the other two terms in the expansion (49) analogously. Hence, 


sup f(u,v,w)< max f(u',v',w')-\ —• sup f(u,v,w), 
u,v,w&i(l) u',v',w'eM 4 UiUilueB s (1) 


implying that 


sup f(u, v, w) < 4 • 
( 1 ) 


max f(u,v,w), 

u,v,w£A4 


(50) 


and it remains to bound the right-hand side of inequality (50). For a fixed triple u,v,w € A4, 
we apply the arithmetic mean-geometric mean inequality, to obtain 


Then 


Note that 


\xjw\ < i (|xfu| 3 + \xU 3 + \ X I w | 3 ) 
o 


f(u,V,W ) < 


1 

3 



E[f(u,v,w)} 




xj u| 31 


]+E[\xJvf}+E[\xiw\ 3 } 


< ca 


3 

xi 


using the sub-Gaussian assumption on the Xi s. Finally, we invoke a concentration bound on 
f(u,v,w). We use a result from Adamczak and Wolff [1]. Theorem 1.4 of that paper gives a 
concentration result for i.i.d. averages of polynomials of sub-Gaussian variables, implying in 
particular that 


P(| f(u,v,w) 


E [f(u, v, u;)] | > t) < ci exp 


^rniri 


nt 2 (nt) 2 / 3 | \ 

S) 


Vf > 0. 


Setting t 


ca ; 


and taking a union bound over all u,v,w € A4, we then conclude from 


inequality (50) that 


sup f(u , v, w) < ca\ 

U,V,w£M 2 ( 1 ) 
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with probability at least 1 — c\ exp (—c' 2 k). Plugging back into inequality (48) then gives 


Qss ~ (V 2 £n(/?*)) ss < CK 3 al\\P° - p*\\ 2 ■ < cn 3 alrd \ (51) 

A \ IL \ f L 

We further note that 

n 

v T {(V 2 C n {p*)) ss }w = -^2£"(xJ/3* -yi) ■ (xfv) ■ (xfw), 


1=1 


which is an i.i.d. average of products of sub-Gaussians (since £" is bounded), so an even easier 
covering argument establishes concentration to v T {(V 2 £(/3*)) ss } w. Hence, 


Q? s -(V 2 £(r)) 5S 

Combining inequalities (51) and (52) gives 

Qg 5 -(V 2 £(/3*)) 5S 




(52) 


<c" 


Then by a simple matrix inversion relation (cf. Lemma 12 in Loh and Wainwright [36]), we 
have 


(Qss)- 1 - {v 2 c(p*))sl 


< c 


as well. Returning to equation (47), we see that 

Wffi-PsWoo 

{(Qss)- 1 - (V 2 £(/T))^} (V£ n (/3*)) s 


< 


+ 


(V 2 £(r)) S 5(V£ n (/3*)) s 


< • Vk ||(V£„.(r ))slloo + || (V 2 /:(r )) 5 5 (V£ n (/8*)) 5 


< c 


log k 


n 


assuming n > k 2 . This is the desired result. 


□ 


In particular, Lemma 3 implies when /3^ in > + 7 A, we have 


V?a(As) = A sign (fis) = A z s . 

Furthermore, the selection property implies Vq\(Ps c ) = 0. Plugging these results into equa¬ 
tion (44) and performing some algebra, we conclude that 


Z S c = ^{Qs-^Qss)- 1 (V£n(r))s - (VCn(n)sc }, 


SO 


I %S c 11 00 tL , 


QscsiQss)- 1 (V£ n (/3*)) s + -|| (V£ n {P*)) Sc 

00 A 


(53) 

(54) 
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We now use similar arguments to those employed in the proof of Lemma 3 to control the 
terms in inequality (54). Note that ||V£ n (/3 *)|| 00 < by assumption, so we can focus 

on the first term. We have 

(Q — V 2 £ ri (/3*))u;| 

n -i 


1 P1 

-E/ if' (xfiP + t (P - P*)) ~Vi) ~Vi)) dt '( x I V )( x i 

1 _ n rl 

■■3-T, 
n Jo 


w 


t ■ \xf (J3 — (3 *)\) dt ■ \xfv\ ■ \xfw\ 


I _L % "A rr-t rji rj-i 

< «3 j3 — j3 || 2 • sup < — > \Xa u\ ■ \x i u| • Xj w\ 
ueBf(i) 1 ” 


i =1 


< K^r • sup 
ueBf(i) 


1 


n 


E l T I I T I I T I 
|Xj U I • \X i V\ ■ \X i w\ 


1=1 


By essentially the same bounding and covering argument as before, we conclude that 

He 


Qss - (V 2 £(/?*)) ss ||| 2 < 


and 


, -l 


< C '\A 


(Qssr'-iv^nyss 

with probability at least 1 — c\ exp(— 02 k). Furthermore, we may show that 

Tin (^72 rto*W I / ../// /fc + log p 


max 

jeS c 


{&. s -(V 2 £(/3*)) s „ s }|| 2 < c '" v / 


n 


(55) 


(56) 


with probability at least 1 — p exp(—c 2 min{fc, logp}) by a similar argument, this time taking 
a union bound over j € S c rather than all unit vectors for one of the coordinates in the 
covering. Defining 


Si := Qsos ~ (V 2 £(/3*)) S c S , and S 2 := (Qss )" 1 - (V 2 £(/3*)) S s , 

we may conclude that 

QscsiQss)- 1 (V£„(r))s| ^ < \\SiS 2 (V£ n (r))sll 0O + <51 (V 2 £(/3*))^ (V£,„(/T )) 5 

+ || (V 2 £({3*)) ScS S 2 (V£„03*))sIL 

< max ||ej<5i.|| 2 • l^Ia ■ Vk\\VC n (^) IU + max ||ej«5i || 2 • Vk (V 2 £(/T))“* (V£ n (/T))< 
j£S c jes c 


< c 


+ III (V 2 £(/3*)) 5C J 2 • III^2III 2 • y/k ||V£ n (r)|| c 
log p 

5 

n 


with probability at least 1 —c) exp(—c 2 min{fc, logp}), assuming the scaling n > maxjfc 2 , k logp} 
and using the inqualities (55) and (56) above. In particular, for A > C\j pp, we conclude 
at last that the strict dual feasibility condition ]|'zsc || 00 < 1 holds, completing step (ii) of the 
PDW construction. 
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Step (iii): Finally, we establish that j3 = ( (3 $, 0 s c ) is a local minimum of the full program (2) 
and in fact, all stationary points of the program must take this form. A classical result by 
Fletcher and Watson [18] gives sufficient conditions for a point to be a local minimum of a 
norm-regularized program. Rather than repeating the details here, we refer the reader to the 
argument provided in the proof of Theorem 1 in Loh and Wainwright [36], which may be 
applied verbatim to establish that f3 is a local minimum. Now suppose j3 is a stationary point 
of the program (2) satisfying ||/3 — /3* ||2 < r. By the RSC condition (13) applied to the pair 
(P, f3), we have 

a \\P ~ P\\l ~ r^-\\p - p\\i < (V£„(j9) - V£„(0), P-P). (57) 

By the convexity of ^||/3||| — q\(P), we also have 

(Vq\0) ~ Vqx(P), P~P)< HI P ~ P\\l (58) 

Finally, the first-order optimality condition applied to (3 gives 

0 < (yC n {P) - Vqx0), p- P) + A ■ (z, p- p), (59) 

where z € <9||/3||i. Summing the inequalities (57), (58), and (59), we obtain 

(a - tiWP - pf 2 - T^WP - P\\l < {Vq x (p) - VC n (p), P-P) + \.{Z,P-P). (60) 

n 

Recall that since P is an interior point, we have the zero-subgradient condition 

V£ n (p)-Vq x 0) + Xz = O. 

Combining this with inequality (60), we obtain 

(« - ^WP-PWl ~ r^^-\\P- P\\l < A • (z, P~P) + A • (z, P- P) 
n 

= X.(z,P)-X\\P\\i + X-(z,P)-X\\P\\i 

<X-(zJ)-X\\P\\i- (61) 

We now show the following lemma: 

Lemma 4. Suppose 5 > 0 is such that ||Js , c || 00 <1 — 5. Then for X > we have 

\\P-Ph< 0 + 2 ) Vk\\p-P\\ 2 . 

Proof. This is identical to the proof of Lemma 7 in Loh and Wainwright [36]. □ 

Using Lemma 4 to bound the left-hand side of inequality (61), we then obtain 

M - r0 + 2 ) ^ || P ~P\\l<X- (z, P) - X\\P\\i, 

so if n > (| + 2)“ klogp, this implies 

0<\-{zJ)-\\\P\\i- 

At the same time, Holder’s inequality gives 

A • (z, P) - All^lli < A • ||z||oo||/ 8 ||i - \\\Ph < All^lr - A||H|r = 0. 

Hence^ we must have (z, P) = ||/3||i. Since ||2s’c || 00 < 1 by assumption, this means that 
supp(/3) C S, as wanted. 


35 






B.3 Proof of Theorem 3 

We derive the following variants of Lemmas 1 and 2 in Loh and Wainwright [35] ; the remainder 
of the argument is exactly the same as in that paper, so we do not repeat it here. The reason 
why we need to revise the two lemmas is that the proofs in Loh and Wainwright [35] require 
the statement of the RSC condition in that paper, which also provides control on the behavior 
of C n outside the local region. 

Lemma 5. Under the conditions of the theorem, we have 

WP -Ph< T y Vf>0. 

Proof. We induct on the iteration number t. Note that the base case, t = 0, holds by 
assumption. Suppose t > 0 is such that ||/3* — /31|2 < §; we will show that \\/3 t+1 — (5 1 |2 < §, 
as well. 

By the RSC condition (24), we have 

ol 11 / 3 * - Ml - t'— WM - Ml < Cn(P) - CnV?) - {VCnV?), P~ M) ■ (62) 

n 

Furthermore, since ^||/3||2 — Q\(P) is convex by the /x-amenability of p\, combining inequal¬ 
ity ( 68 ) with (/3i, /? 2 ) = {P t ,P) and inequality (62) and the inequality 

\\P t+ %+ ( S ign(p t + 1 ) J-/3 t+1 ) < 11 ^! 


implies that 

CniM) + (VA,(/3*) - V^), p- P) - qxi?) + A||/3 i+1 || 1 + A(sign(/3* +1 ), p - p t+1 ) 

+ (a' - |) || p t - Ml - t'^II/ 3* - jS||? < C n (P) - q\(P) + Ap||r, 

so 

End?) + (V£„(/3 4 ), M - M) + X\\p t+1 111 + A(sign(/3 f+1 ), p - p t+l ) - t'^||/3* - Ml 

n 

<l n 0) + M\Mi- (63) 

By the RSM condition (25), we have 

Cn(P t+1 ) - C n (/3*) - (V£ n (/?*), p t+l - pt) < a"\\p t+1 - pt\\l + r J -^\\p t+1 - 

n 

and combined with the convexity of q\ , we have 

Cn(P t+1 ) ~ C n (pt) - (VCrfpt), p t+1 - pt) < a"\\p t+1 - /3*||| + T"^||/3 m - pt II?. (64) 

n 

Combining inequalities (63) and (64) then gives 

{C n (P t+1 ) + X\\P t+ 1 \\l) ~ (C n (P) + X\\Ml) 

< (VCn(pt), p t+1 -p) + A(sign(/3 m ), P t+1 -P) + Oi"\\p t+l - ^||| + 4 R 2 (t' + r") —, 

n 

(65) 
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using the fact that \\/3 t+1 — P t \\i, ||/3* — /3||i < 2 R, by feasibility of each point. Note that the 
left-hand side of inequality (65) is lower-bounded by 0, since (5 is a global optimum. Finally, 
note that from the first-order optimality condition on equation ( 22 ), we have 


(V£„(/3*) + p(/3 m - /3*) + Asign(/3 m ), /3 m - 0) < 0. 
Combining inequality ( 66 ) with inequality (65) then gives 

0 < (£„(/3 m ) + A||/3 t+1 || 1 ) - (Z n 0) + AH^II^ 


( 66 ) 


< ot 


"II fl*+l _ fl*l|2 


-n + {r' + T ") 


4 R 2 log p 


n 


= [a “2 


IA llflt+l _ flt ||2 _ 1\\Rt+l 




- ,{^‘ +1 - ff , / S ,+1 - 0 ) 


-?ll2 + ?r-?llI + (r' + T") 


4i ? 2 log p 


< 


Vnot 


|2 _ R II flt +1 




2 

4i ? 2 log p 


n 


n 


(67) 


using the assumption that r] > 2a”. Hence, 


Dt+l 


\i< 


2 , 8 (r' + r") R 2 logp 


+ 


n 


Using the inductive hypothesis and the assumption that n > ——^-logp, we then have 


Dt+l 


I 2 - x + T ~ r ' 


In particular, we may apply the RSC condition (24) to the pair (/3 i+ 1 ,/3) to obtain 
,/iifl*+l i?i|2 T MP \\ a t+l n\\2 s' r mt+i\_r _/Y7r (a\ Rt+ 1 


a 


n 


- m < /ln(/3 t+1 ) - £ n (/3) - (V£„(/3), - /3). 


By the convexity of ^||/ 3||2 — qx{P), we have 


(Vq X (P), /3 t+1 -0)> q X ((3 t+1 ) - q x 0) - |||£- /3 m || 2 . 


( 68 ) 


Together with the inequality 


|i + <sign(/3), /3 t+1 - /?) < 


3t+l| 


we then have 


(<»' -f) v w -Ml- T'hpii,3 ,+1 - M 

< (f„(/3 ,+1 ) + A||/3 ,+1 || 1 ) - (c n 0) + All^iu) - (V£„(g) + Asign(g), /3 ,+1 - (69) 

Finally, the first-order optimality condition on /3 gives 

(VCnW) + Asign(^),/3 t+1 -^)>0. 

Combined with inequality (69), we conclude that 

(«' - f) ll/3 t+1 - M - - £11? < (Z„W‘ +l ) + A||/J i+I ||i) - (c n 0) + AllJlu) . 

(70) 
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Inequality (67) gives an upper bound on the right-hand side of inequality (70). Combining 
the two inequalities, we then have 


' _ II »t +1 


a — 


Hence, 


Dt+l 


~ P 9 - T 


< 


/log p. 


Dt+l 


n 


i?<>‘ 


,2 Vnnt+l «||2 , ,_/ , _„ N 4i? 2 logp 


-m+v+T*)- 


n 


ry/2 


12 - a' — /i/2 + 77/2 

1 

+ 


< 


a' - pt /2 + 77/2 

V2 llflt 


(r' + r") 


/n 4h * 2 logp , log p || pt+i 0112 


+ r 


n 


n 

?2 


a' - 77/2 + 77/2 


2 4 ( 2 t' + t") R 2 \ogp 

I 2 + 


a' - 77/2 + 77/2 


n 


Using the inductive hypothesis one more time and the scaling assumption (26), we conclude 
that 


Dt+l 


< 


77/2 


— + 


4(2 t'+ t") R~ logp r 


^ 112 “ a' -77/2 + 77/2 4 ' a'-77/2 + 77/2 

completing the induction. 


77 


£ 4- 


□ 


Lemma 6. Under the conditions of the theorem, suppose there exists a pair ( 77 , T) such that 

</>(/3 t ) - <M/3) < 77 , Vi > T. 

Then for any iteration t >T, we have 


— P\\i < sVkWft — /3 || 2 + 16x/k\\/3 — 73 * || 2 + 2 • min 


2?7 

T : 


R • 


Proof. This proof is in fact a simplification of the argument used to prove Lemma 1 in Loh 
and Wainwright [35], since by Lemma 5 and the assumption, we are guaranteed that 


-n < 


\2 + \w-n <^ + ^ = r, 


so we may apply the RSC condition (24) directly. Denoting A := /3 t — (5*, we then have 


a'||A|| 2 - r'^||A|| 2 < C n (^) - T n (/T) - (V£ n (/ 3 *), A) 


n 

< Cnifi) ~ £n(P) + ||V£ n (/nilco • ||A||i 

< Cn^) ~ CnW*) + ^ l|A||i. 

o 

Furthermore, by assumption, we have 

Cntf*) ~ £n(P*) + pxtf) - P x(n < fj, 

which combined with inequality (71) implies that 

o'llAIll - r'^IIAH 2 < p\(J3*) - px(^) + fj + ^||A||i. 


( 71 ) 


(72) 


( 73 ) 
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Note that if 77 >lll A||i, the desired inequality is trivial. Hence, we assume that rj < -j|| A111. 
In particular, inequality (73) implies that 

o'llAHi < pxW*) - pxi?) + r'^IIAH? + A 


A| 


n 


< Px(n - px{?) + 2r / i?^|| AH, + — II A| 


n 


A, 


<px(n-px(n + 2IIAH1 

< PxifT) - PxV?) + ^ % IIAIII 

< pxin - pxi?) + + ^iiaiii 

<*- P x{n-\px{P t )+^ iiaih . 


Hence, we have 


0 < (o' - 2) IIAIII < -pXjT) - - W (/3‘). 

By Lemma 5 in Loh and Wainwright [35], we then have 

Px(P*) - PxW*) < 3 P x(n - Pxi?) < A(3||A a || 1 - HA^IU), 


( 74 ) 


where A indexes the top k components of A in magnitude. Combining inequalities (73) 
and (74), we then have 


or 7 11A11 § — r 


/ log p. 


n 


A||i < 3 A||Aa||i — A||A^ c ||i + V + —1|A||i, 


so 


0 < ex' ||A||2 A 3A11A^41| 1 — A||A^Hi + i) + —1|A||i + 2 t* R —-—1|A|| 1 

8 n 

< 3A||A^41| 1 — A||Aa=||i + fj + —1|A||i 

<yll^lli-|l|A^||i+5. 


Hence, 


||Aa=||i A 7 ||A^ 41 | 1 + 


2 T) 

T’ 


so 


IA111 < || A^41| 1 + ||A^c||i < 8|| A^41| 1 + — < 8\/fc||A||2 + ^ 


A C ||l-0||^A||l^ 2r/ ' 2?? 

Also, || A||i < 2 R, so we clearly have 

|| A || 1 < 8Vk\\A\\ 2 + 2 • min 

Further note that by essentially the same argument, with inequality (72) replaced by 

£ n 0) - C n (J3*) + px0) - Px(P*) < 0, 


2 rj 

T 


, R 


(75) 
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we have the inequality 


2 - 


(76) 


w - nii <8Vk\\p- p* 

Combining inequalities (75) and (76) and using the triangle inequality then yields 

11/3* - jS||i <||£-/?*||i+ ||/3* -nil 

< 8 Vfc (||/3 - P*h + 11/3* - /3 || 2 + ||/3 - /3*|| 2 ) + 2 • min ^y, ^ 
= 8\/fc ^||/3* - /3 || 2 + 20- /3 *|| 2 ) + 2 • min ^y, 7?^ , 

completing the proof. 


<8^(||£-/31 2 + ||/3*-/3*|| 2 ) 


+ 2 • min 


2 r] 


□ 


C Proofs of propositions in Section 3.2 

In this Appendix, we provide the proofs of the technical propositions establishing sufficient 
conditions for statistical consistency of stationary points in Section 3.2. 


C.l Proof of Proposition 1 

We have 


|V£ n (/3*)||oo = 


1 n 

~ ^2 w ( x i) x i ' ^ ( e * ' v ( x i)) 


i— 1 


Since Xi _LL by assumption, the tower property of conditional expectation gives 


E [w(xi)xi ■ 7'(e, • u(xj))] = E 


E [£'(ei ■ v(xi )) | Xi] ■ w(xi)xi 


(77) 


Under condition (2a), the right-hand expression of equation (77) may be written as 


E 


E [t'(ej) | Xi] ■ w(xi)xi 


= E 


E[7'(ej)] • w(xi)xi = E[f'(e,)] • E [w(xi)xi] = 0. 


If instead condition (2b) holds, the right-hand expression of equation (77) is clearly also equal 
to 0 . 

Finally, note that since is bounded, the variables £'(ei-v(xi)) are i.i.d. sub-Gaussian with 
parameter scaling with k\. By condition (1), the variables w{x{)xi are also sub-Gaussian. 
Hence, the desired bound holds by using standard concentration results for i.i.d. sums of 
products of sub-Gaussian variables. 

C.2 Proof of Proposition 2 

We begin with the outline of the main argument, with the proofs of supporting lemmas 
provided in subsequent subsections. The same general argument is used in the proofs of 
Propositions 3 and 4, as well. 
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C.2.1 Main argument 

We have 


T(P i,ft) := <V£ n (ft) - V£ n (ft), ft - ft) 

1 n 

= - Vftftfft “ Vi ^ ~ ~ Vi)) X I (ft “ ft)' ( 78 ) 

re r-f 

«=i 

Under the assumptions, equation (78) implies that 

1 n -\ n 

T(ft,ft) > -^(H (xf Pi-yi)-H (xf p 2 -yi))xf (Pi-p 2 )lA i -^2 — ft) (a^f (ft ~ ft)) 2 Uf, 

Tl. 71 . 

2 — 1 2—1 

(79) 

where we set k 2 = 0 in the case when £ is convex (but ft does not necessarily exist everywhere), 
and the event ft is defined according to 

Ai -.= |m < ^ j n j|xf(ft - ft)I < ^llft -ftlbj n j|xf(ft -ft)| < j j , (80) 

for a parameter T > 0, using the definition (18). Inequality (79) holds because when i is 
convex, each summand in inequality (78) is always bounded below by 0; and when ft exists 
and satisfies the bound (16), the mean value theorem gives 

{£'{xjp i - yt) - l'(xfp 2 - Vi))xJ(P i - ft) = £"{ui) (xf(ft - ft)) 2 > -K 2 (xf (Pi - ft )) 2 , 

where Ui is a point lying between xf Pi — yi and xf p 2 — y,. 

Note that on Ai and for \\Pi — ft|| 2 , \\P 2 — P *|| 2 < r, the triangle inequality gives 

\xf ft - Vi\ < \xj(P 2 - P*)\ + |e*| < T, 


and 

I xJPi - yi\ < \xf (Pi - p 2 )| + | xf(p 2 
Hence, the mean value theorem implies that 


T T T 

«i + w< T + I + I 


= T. 


£'(xJPi - yi) - l'(xjp 2 - yi) = £"(ui)xJ(Pi - ft), 


for some Ui with \ui\ < r. We then deduce from inequality (78) that 


n i n 

T(ft,ft) > a T ■ - ft] (xf(Pi - P 2)) 2 lAi ~ «2 • - ^2 (xf (pi - p 2 )) 2 1 A - 

Tl. Tl. 

i=l i=l 

. n 1 n 

= (a T + K 2 ) • Oft(ft - ft)) 2 1 A, ~ k 2 ■ -J 2 (ft(ft ~ft))“ 

" ; 1 " ; 1 

1 n 

> (a T + k 2 ) ■ - 7 , T||/3 1 -/3 2 ||2/8r (xf (Pi - ft)) ' ftr/ 2 (u) ■ 'Pt /4 (xf (ft - ft)) 

^ -1 
2 — 1 

1 n 

-^■- E ^ ft - ft )) 2 

2—1 

:= (a T + re 2 ) ■ /(ft, ft) - re 2 • /(ft, ft). (81) 
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Here, we have defined the truncation functions 


2 

u\ 


if |u| < 


<Pt(u) = (t — u) 2 , if | < |u| < t, and ft(u) = < 

0 , if 1 > t, 0 , 

' II — V 7 

as well as the functions 


1 , if |tt| < 

2 — j\u\, if | < |it| < t, (82) 


if |it| > t, 


1 n 

/(ft, ft) ■= ~ E VrWPi-foh/ar ( x ?(ft ~ ft)) ' ■ ftr/4 (xf (ft - ft)), 

Z=1 

— 1 71 

/(ft,ft) :=-E(^(ft-ft)) 2 - 

Z =1 

Note in particular that and ft are t-Lipschitz and |-Lipschitz, respectively, and the trun¬ 
cation functions satisfy the bounds 

tpt(u) < u 2 • 1{H < t}, and ft (ft < l{|xx| < t}. 

Note also that inequality (81) also implies the simple bound 

<V£ n (ft) - V£ n (ft), ft - ft) > -K2 ■ /(ft, ft). 

We now define the sets 

B 5 := ((ft, ft) : lift - ft|| 2 , lift - P *||2 < r, ^ ~ ^l 1 . 1 < 5, ft, ft € Bi(ft 


(83) 


for a parameter 1 < 5 < c A /^Let 


Z(ft := sup 


1 


and 


Z(5) := sup 


(PifofeBg l lift - ftlli 

1 


|/(ft,ft)-E[/(ft,ft)]| 


/(ft,ft) — IE /(ft,ft) 


(PiMeBs l lift ~ ftlli 

With this notation, inequality (81) implies that for all (ft,ft) € ft, we have 


T(ft,ft) ^ , , IE [/(ft, ft)] 

-i 2 >(«T + « 2 )- no * || 2 ~«2 


E 


/(ft, ft) 


i ft 112 ' lift-ftlli “ lift-ftlli 

E[/(ft, ft)] (“T + K 2 ) (E[/(ft,ft)] - E[/(ft 


— (a-r + k 2 )Z(5) — re 2 Z(ft 


= • 


ft-ftlli lift-ftlli 


- («T + K2)Z{5) — Z(ft. 

(84) 


The following lemma bounds the difference in expectations as a function of the truncation 
parameters. The proof is provided in Appendix C.2.2. 


Lemma 7. We have the bound 
E 


/(ft,ft)J -E[/(ft,ft)] < ca 2 ||ft -ft||i (e% 2 +exp ■ 
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In particular, Lemma 7 implies that when inequality (19) holds, we have 


(a T + K 2 ) (£[/(&,&)] — E[/03!,ft)]) < ^ 

since 

E[/G9i,A)]>A min (E x )-||l9 1 -A||i. 

Then inequality (84) implies that 


m,^) v «t e[/(a,^ 2 )] 
lift-All! “ 2 ' IIA-Alll 


(«T + K2)Z(S) — k 2 Z(5). 


(85) 


We now focus on the terms Z(<5) and Z(<5). Note that f(/3i,/3 2 ) is a quadratic form in f5\ — /3 2 , 
and for each unit vector v € R p , the quantity ^ Y^?=i( x I v ) 2 is an hi.d. average of sub¬ 
exponential variables with parameter proportional to cr 2 . Then by Lemmas 11 and 12 in Loh 
and Wainwright [34], we have the bound 


m,p2)~ nm,p2)\ 


< ta. 


l + ^^IIA-AII?, VA,ft€R p (86) 


with probability at least 1 — c\ exp(—C 2 nt 2 + c^klogp). In particular, since 6 < c^j we 
may guarantee that 

E [f(pi,/3 2 )\ 


k 2 Z(S) < 


4 I \R„ _ 112 > 


ft-ftll* 

w.h.p. Turning to Z(<5), we have the following lemma, proved in Appendix C.2.3: 
Lemma 8. For some constants c,d, and d’, we have 

P (z(S) > d'a x (^- + \f^f] < cexp(-c / logp). 


(87) 


( 88 ) 


Combining inequalities (87) and (88) with inequality (85), we then have 


T(A,ft) E[/(ft,A)] , , (RT t ST\ /log p 

- t - wFm ~ (ar+ K * )c + t) r —■ 


with probability at least 1 — c\ exp(—C 2 logp). Let nF R 2 log p be chosen such that 


(89) 


(cut + K 2 )d'a x 


RT j\ogp < or nmjh)] 
n - 8 WPi-foW 


2 • 

2 


Then inequality (89) implies that 

T(ft,&) X «T E[/(/?!,/3 2 )] 

llft-ftlll ~ 8 ' llft-ftlli 

We now extend inequality (90) to a bound that holds uniformly over the domain, with 5 
replaced by . This is accomplished via a peeling argument in the proof of the following 

lemma: 


d'(a T + n 2 )a x T /log p 


n 


(90) 
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Lemma 9. Fix cq > 0, and let 


-T> ) a a r- TO (T>\ II a o*m ||o fl*ll ^ j 1101 /^2111 , CoQ.T^‘min(^‘x) r I 

V := 0i,0 2 € Mi(R) : ||/5i - 0 || 2 , ||p 2 - 0 \\ 2 <r and — -rr-^- < 


\ ~ P 2 W 2 a x (a T + K 2 )T V logp 

With probability at least 1 — d x exp(— d 2 log p), the following inequality holds uniformly over all 

Pi,fa€T>: 


T(Pi,P 2 ) . A min (S x ) c"(a T + k 2 )ct x T \\Pi - p 2 \\i /log p 

-J2 > «T ' -«- 


-mi 


> O/.J' 


8 c 0 r IIPi — P 2 II 2 V n 

Amin/Si) d"(a T + k 2 ) 2 (7 2 x T 2 log p \\Pi - p 2 ||f 


16 


, 2„2 


cSr 


n \\Pi 


(91) 

(92) 


The proof of Lemma 9 is provided in Appendix C.2.4. 
Finally, note that inequality ( 86 ) implies the bound 


f^i,P2)<a , \\p 1 -p 2 \\ 2 + r 


2 , /!°gp 11 « a 112 


n 


i — /52 ||i, V0i,0 2 € 


Together with inequality (83), we then see that for a proper choice of the constant cq, we have 


T(0i,0 2 ) > -« 2 (a'||/3i - ^ 2 1|1 - - 0 2 ||?) 


> QfT 1 


Amin (Ex) ||Q ^ 2 d"[aT + K 2 ) 2 (J 2 T 2 logp. 


16 




2 t .2 


Cnr 


n 


II? (93) 


whenever > c "^ aT c + K2)rTxT ^^- Combined with Lemma 9, inequality (93) implies 

the RSC condition (13). 


C.2.2 Proof of Lemma 7 

Note that 


E 


(s^ft-ft)) - E[/(/3!,/3 2 )] < E 


(®f( 0 i — 0 2 )) 2 1 <j \xJ{Pi- p 2 )\ > ^-\\pi - p 2 1| 2 


{xf(Pi - p 2 )) 2 l\ |ej| > ^ 


+ E 
+ E 

Applying the Cauchy-Schwarz inequality, we have bounds of the form 


(®f( 0 i - P 2 )) 2 1 \ \xf (0 2 — P*)\ > 


• (94) 


E 


11/2 


{xj (01 — 0 2 )) <E (*r(01-0 2 )) • E [1_bJ 1/2 < ccr ll|01 — 02 111 ' (P(-E'i)) 172 ) 


where the second inequality holds because of the assumption that is sub-Gaussian with 
parameter a 2 . 


Furthermore, note that 


T 


Xi (02 /**)I > T ) ^ cex P - 


4 / V air 2 ' ’ 


dT 2 
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since Xi is sub-Gaussian and \\fa ~ /?* H 2 < r by assumption. Finally, we have 

2 )| > ^WPi-foh') <cexp (-^ 2 ) > 

also by sub-Gaussianity of x^. Combining these bounds with inequality (94) then implies the 
desired result. 


C.2.3 Proof of Lemma 8 


We first bound E[Z(<5)]. Following the argument in the proof of Lemma 11 of Loh and 
Wainwright [35], we have 


E[Z(5)]<2 X I-K 


sup 


(/3i ,/3 2 )£B s ll A ll 2 


- y '.9i- ^T||A|| a (xf (Pi - fa)) 1 pT ( ei)-ipT (xf (fa - /3*)) 

n ' 8r 2 4 


i= 1 


where we denote A := fa — fa, and the g £s are i.i.d. standard Gaussians. Define 

1 1 


Z, 


Pi,@2 


IAII 9 n 
1 112 i=1 


y 9i ■ <P t jiAjja (xf(p 1 - /? 2 )) ^z(ei)^T (xf(fa~P*)) 


and note that conditioned on the xfa, each variable Zg 1 g 2 is a Gaussian process. Furthermore, 
for distinct pairs (fa, fa) and (fa, fa), we have 

var (Zf> x fo — Z M ) < 2var {Zp^p 2 — Zg'+A,/3') + 2var (Zp 2 _ i-A',/3' — z p , 1 ,p , 2 j 

Continuing to condition on the xfa, and denoting A' := fa x — fa 2 , note that 

1 n 

var ( Zp a+A>/3 ' - = ^2 ^|( e *)^| (^(/?2 - /3*)) 

2=1 


IA 


Furthermore, (p satisfies the homogeneity property that 


1 


1 1 \ 2 

-^ THAib (xf A) - —-^ tiia'ii, (xfA') . (95) 
A || 2 8 r ||aA 11 2 8 r J 


-n • <Pct(cu) = tpt(u), Vc > 0. 


Hence, inequality (95) implies that 


var ( Zg, +Aj/3 , - Zp,^ 1 < 


) £ ^E|K|j (44 - »(*f A '• 


< — 

n z 


n* 


1 


T 2 ||Alio / y T\t IIAII2 

AM ' 64r 2 " ( A * x i A 


E 

y. r 2 /xfA xf A' x 2 

A-' 64r 2 1 ||A|| 2 11A'11■ 

2=1 X " " " 


I A'I 
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where the second inequality uses the Lipschitz property of p. Similarly, we may calculate 


var ( Zp u p 2 — Zp> 2+ A j( g' 




n 


%— l 


y 2 T||A || 2 (sfA) (^t (xf (/?2 -p*))~ 4T (xf {02 - /?*))) (96) 


8 . . T 2 IIAII 2 

Using the fact that ' 0 T /4 is ^-Lipschitz and pT\\a\\ 2 < Qt ! R J) , inequality (96) implies that 

8 r 

1 . n . T 4 fi4 9 1 „ n . T 2 


var ( Z 




- Z 


/3'+A,/3' I E 


i =1 


z —' 256 2 r 4 T 2 
2=1 

If we define the second Gaussian process 


i= 1 


Y f 


Pi,132 ■ 


— - 'V Cl' r T B 4- — I V" ~ &) 

• n E& • + 4r • „ Eft • (I/?, - /3 2 || 2 ’ 


16r 2 


2=1 


/i / , di 

4 r n z —' 
2=1 


where ( 3 ' and g" are independent standard Gaussians, the above calculation implies that 

var (z Pufi2 - Zp^ < var (y 8l ,p 2 - Yp[, 

Hence, Lemma 14 in Loh and Wainwright [35] implies that 


E 

sup Zp lt p 2 

< 2 -E 

sup Yp 1: p 2 


{0i,02)GB s 


(Pi,02)eB s 


where the expectations are no longer conditional. By an argument from Ledoux and Tal- 
grand [31], we also have 


E 

sup \Zp lt p 2 \ 

< E 

1 Z 0' 1 ,0' 2 \ 

+ 2 • E 

sup Zp lA 


{01 ,02)&B s 




{01 ,02)&B s 


for any fixed (/3[,/3' 2 ) € Bs . Furthermore, 


E 


I Z P'l ,02 I 


< 




T 2 


1 


256r 2 Vra’ 

by conditioning on the Xj’s and using the bounds on p and i/j. We also have the bound 


E 


sup Yp 1) p 2 
{01,02 )&B S 


<TL. K 

16r 2 


1 


n 




2=1 


ST 

+ — -E 

4r 


1 


n 


E 9 'i Xi 


2=1 


< + S Y) J*se. 

" z r j V n 


V 


Hence, 


./ 2 


E[Z(<J)] < cV 


r / V n 


l r 2 


(97) 


Further note that for (/3i, /%) € Bs, each summand in f(/3i, fa) lies in the interval 
Hence, by the bounded differences inequality, we have 


0 E 

u ’ 64r 1 


P(l Z(S) - E[Z(5)]\ > t) < cexp ( -^nf 2 


rjp2 

Combining inequalities (97) and (98) then gives the desired result. 


(98) 
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C.2.4 Proof of Lemma 9 

We parallel the peeling argument constructed in the proof of Lemma 11 in Loh and Wain- 
wright [35]. Define the event 

£ := {inequality (91) holds V/3 i,/3 2 £ D} , 


and define the functions 


h(/31 , /?2; X) := cut • 


Amines) T(/3l,&) 


8 


- A 2 II 2 


2 5 


: = 


c"(«T + K 2 )(J X T / log p 


2 c 0 r 

7/0 ON. Il/3l — /?2 111 

'* (A ’ A) - PTftBl 


n 


By inequality (90), we have 

( 


sup h(/3i,p 2 ;X) > g(5) \ < ci exp(-c 2 logp), 


(/3i,/3 2 )eB: 


fnr nnv 1 ^ A ^ c 0 a T A min (S a; )r / n q: ||P1-P2||1 ^ i ^ 

tor any 1 < 0 < ^ (aT+K2 )r y logp- bmce H/L-ftlb - we have 


1 < h(Pi,p 2 ) < 


WPi-foWi 

||/3l —/?21|2 

Cq O 7 1 Amin ( P'x ) / 


^(OT + K 2 )T V logp’ 
over the region of interest. For each integer m> 1, define the set 

:= {(&,&) : 2 m “V < g{h{Pu P 2 )) < 2 m fi}nV, 

where fi := cc ° a r xT A union bound gives 

M 

P(£ c ) < J]P{3(A,^) € : htfufcX) > 2g(h(^, 

m= 1 


(99) 


where M := 



ra=l 


Then 


sup h(/3i,/32]X)>2 m g, 

(,Pl,Pz)€T>: 

^(/3i,/9 2 )<g- 1 (2-p) , 


< M ■ ci exp(—c 2 logp), 


using inequality (99). Hence, 


P(£ c ) < Mci exp ( —c 2 logp + log log 


n 


logp 


< c[ exp(-c' 2 logp). 


Inequality (92) holds by applying the arithmetic mean-geometric mean inequality to inequal¬ 
ity (91). 
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C.3 Proof of Proposition 3 

Again defining T(ft,ft) as in equation (78), we have 

1 n 

i,ft) = ~^2w(xi) (£'(xfPi -yi) -£'(xjp 2 ~ Vi)) xf (/3 i - ft). (100) 

Ti . 1 
2 — 1 

Defining the event ft as in equation (80), inequality (100) implies that 


1 

T(/3i,/? 2 ) > (ftxfft -yi) -l'(xfp 2 - yi)) xf (/3 i -ft) 

2=1 

^ n i n 

> a T • - (zf (ft - ft)) lAi - k 2 • - ^2w(xi) (xf (ft - ft)) 1 

n i=i n i=i 

^ n i n 

= (a T + k 2 ) ■ -£>(*0 (xf(Pi - ft)) l Ai - K 2 • -^2w(xi) (xf(ft - ft)) . 

Tl. 71 . 

2=1 2=1 

Note that w(xf)xi is a sub-Gaussian vector with parameter cbf Defining the truncation 
functions (p and if as in equations (82), we then have 

T(ft,ft) > (a T + k 2 ) ■ /(ft, ft) - k 2 • f(P i,ft), 

as in inequality (81), where 


1 . 

f(P l,ft) := ~^2w(Xi) • ^T||/3 1 -/3 2 || 2 /8r (ft - ft)) ■ ^T/2(e») ' Vft/4 (®f(#2 “ ft)) , 
2=1 

— 1 U 

f{P 1 >A) := - ^w(xi) (xf (ft - ft )) 2 . 

2=1 

We first obtain an analog of Lemma 7, as follows: 

Lemma 10. We have the bound 


E 


/(ft, £ 2 ) - E[/(ft,ft)] < cb 0 a: 


. 1/2 


, C ' T 

+ exp- 

<y x r 


Proof. We have 


E 


w(xi) (xf (ft - ft)) -E[/(ft, ft)] 


< E 


w(xi) (ft (ft - ft)) 1<{ |ft (ft -ft)| > —||ft -ft|| 2 


+ E 
+ E 


w(xi) (xf (ft - ft)) 1 |e*| > 


T 


w (xi) (xf (ft - ft)) M|xf(ft-ft)|> 


( 101 ) 
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Applying the Cauchy-Schwarz inequality to each term, the right-hand side of inequality (101) 
is then upper-bounded by 


E 


41 1/2 


w 2 {xi) (xf(Pi - p 2 )) • < P ( \ X J(P 1 - p 2 )\ > ||/3i — P 2 W 2 


T 


+ : 


8r 
1/2 

2 " 


1/2 


M>-) +p(|xf(/3 2 -r)l> 


1/2 


< c6 0 cr^||/3i — /3 2 ||| ^/ 2 +exp > 

using the assumption that Xi is sub-exponential. 


□ 


Note that the statement of Lemma 8 holds without modification, because the additional 
factor of w(xi ) vanishes in the Gaussian comparison argument in the proof of the lemma, 
since w(xp < 1. Furthermore, f(Pi,P 2 ) is again a quadratic form in f3 1 — /3 2 , and since 
w(xi)xi is bounded and xi is sub-exponential, the quantity - YYl=i w ( x i)( x J v ) 2 is an i-i.d. 
average of sub-exponential terms with parameter proportional to boa 2 . Hence, a version of 
inequality (86) holds, with a 2 replaced by boa 2 . Then Lemma 9 follows by an identical peeling 
argument. Putting together the pieces, we arrive at the desired result. 


C.4 Proof of Proposition 4 

This is very similar to the proof of Proposition 2. Again using the notation for the Taylor 
remainder defined in equation (78), we have 

1 n 

T(/3i,/3 2 ) = - ^2w(xi)xJ(/3i - p 2 ) {£' (( X J Pi - ypw{xi)) - ((xf p 2 - yi)w{xi))} . 

i =1 

Defining 

A i := j|ei| < ^ j n ||m(xi)xf(/3i — P 2 )\ < ^||/3i — /3 2 ||2 n | \w(xi)xj (p 2 - P*)\ < ^ j , 

we have that on the event Aj and for ||/3i — /3* || 2 , \\P 2 — /3*|| 2 < r, 

| w(xi)(xfp 2 - yi )| < | w(xi)xj(f3 2 - P*)\ + \w(xi)ei\ < \w(xi)xf (/3 2 - /3*)| + |ej| < T, 


and 

\w(xi)(xfp 1 - y ^| < \w{xi)xJ(Pi - p 2 )\ + | w(xi)xf(p 2 - P*)\ + \w(xi)ei\ < + ^ 

using the fact that u>(xi) < 1. Hence, we have 

^ n ^ n 

T(Pi,P 2 ) > a T ■ -^2 ( w ( x i) x I(P 1 - P‘ 2)) 2 lAi - k 2 ■ - {w(xi)xf(Pi - p 2 )) 2 1 A? 

n i= 1 n *=1 

^ n 1 n 

= (a T + K 2 ) ■ - ^2 ( W ( X i) X I (Pi - P 2)) 2 Ui - K 2 • - ( w( x i) x J(Pi - P 2 )Y ■ 
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We may define the truncation functions exactly as in the proof of Proposition 2, the only 
modification being that Xi is replaced by w(xi)xi . Furthermore, since \w(xi)xf v\ < bo for 
every unit vector v, the vector w(xi)xi is always sub-Gaussian with parameter , regardless 
of the distribution of Xj. It follows that with 

1 n 

f{P l>/%) := - ^VTWfh-foh/Sr (wixJxJiPi - P2)) ■ ^T/2^i) ' VT /4 (w(xi)xf {@2 ~ P*)) , 

n i=1 

~ 1 n 

f(PI,p2) ■■= - ^ 2 (w(xi)xJ(( 3 1 - P2)) 2 , 

Tl 

1=1 

we arrive at the familiar inequality, 

T(Pl,p 2 ) > («T + K 2) • f(Pl,P 2 ) - K 2 ■ f(Pl,P 2 )- 

The remainder of the proof is identical to the proof of Proposition 2, with Xi replaced by 
w(xi)xi, which is sub-Gaussian with parameter 6g. 


D Proof of Corollary 1 


The proof of this corollary is a fairly immediate consequence of Theorem 2 and the following 
result from He and Shao [25]: 

Lemma 11 (Corollary 2.1, He and Shao [25]). Suppose we have i.i.d. observations from the 
usual linear regression model 

Vi = xJ/3* +ei, 

where (3* € M p . Suppose 

1 n 

£n(P) = - ~Vi), 

n z —' 

i =1 


and the following conditions are satisfied: 

(i) In probability, 0 < A min and A max < oo. 

(ii) l is convex and smooth, l" and I'" are bounded, and E [I"(ei)\ € ( 0 ,oo). 

(Hi) maxi<i< n J^ = O p { 1) and sup | M | 2= | H | 2=1 ± Ya = 1 \xfu\ 2 \xfw\ 2 = O p ( 1). 

Suppose C n has a unique minimizer given by j3. If pl °^ p —> 0 , then \\/3 — P *\\2 = O p 
If p *° gp —> 0 , then for any unit vector v € M p , we have 

c v 



where 


: = 


E[^(ei)]-E (£'(ei)y 


■ v 


X T X 


n 


v. 
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We apply the result to the oracle estimator defined in equation (21), with k taking the 
place of p. Although Lemma 11 requires C n to be convex, a careful inspection of the proofs 
in He and Shao [25] reveals that the results still hold if we restrict our attention to a subset 
of M p on which C n is convex and f3 is the unique minimizer. By Lemma 1, this is exactly the 
case over the restricted region Sft when C n satisfies the RSC condition (13). Furthermore, it 
is straightforward to check that conditions (i)-(iii) of Lemma 11 under the given assumptions. 
Note that by Theorem 2.1 in Hsu et al. [27], we have 

P ’^ 2 > t'j < ci exp(—C 2 &;), Vi, 

when the afts are sub-Gaussian, implying that 

P ( max liidlk > t) < n - c\ exp(— 02 k). 

\l<i<n k J 

Hence, for k > CTogn, the right-hand expression is bounded above by ciexp(— c' 2 k), and the 
first part of condition (iii) is satisfied. 

We conclude that the desired results hold for the oracle estimator /? , and by Theorem 2, 
also for /3. 

E Proofs of additional lemmas 

In this section, we provide proofs of additional technical lemmas appearing in the body of the 
paper. 

E.l Proof of Lemma 1 

For ft, ft € S r , we have 

||ft-ft||i<^||ft-ft|| 2 . 

Hence, the RSC condition (13) implies that 

(Vftftft) - Wft(ft), ft - ft) >(a- lift - ftlll, 

implying the desired conclusion. 

E.2 Proof of Lemma 2 

Note that if X = (Aft ... ,X n ) is a vector of i.i.d. a-stable random variables with 7 = 1 , 
equation (27) implies that for w € we have 

E [exp (it ■ w T X)] = exp (—||u;||“ |t| Q ), Vi > 0 , 

where ||ft| Q := QT ]” =1 Ircl 0 ) 1 ^ 0 . Hence, w T X is also a-stable, but with the scale parameter 
||rc|| Q . Furthermore, if Z £ R is sub-Gaussian with parameter eft, then for a € (0,2], the 
random variable |Z|“ is sub-exponential with parameter ccft. Indeed, the moments of \Z\ a 
may be bounded as 

E [|ZH 1/p < E [|Z| 2p ] ^ < ( cxj z y/p) a < c'a 2 z p , 
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where the first inequality comes from Holder’s inequality and the second inequality follows 
because Z is sub-Gaussian [60]. Hence, Z a is sub-exponential. Consequently, for any 1 < j < 
p, the quantity 


n l / a 


a 


a 


1 n 

W Xe A a = 


i=l 


exhibits sub-exponential concentration to E[|AQj | a ]. 

In the context of ordinary least squares regression with the Lasso, note that for an arbitrary 
1 < j < p, we have 


X T e 


n 


> A = 




n 1 / 0 


> n 1_1/a A | > P 


eJX T e 


n 


l/OL 


> n 1_1// “A 


( 102 ) 


Since 



is a-stable with scale parameter 0 (E[|.Xjj| a ]), by the above discussion, the right- 


hand expression in inequality ( 102 ) is bounded below by a constant c a whenever n 1 1 /"A —>• 0 . 
In particular, this is the case when a < 2. Hence, we conclude that the bound 


A T e 

if. ^ 

jlogp 

n 

OO 

/ n 


does not hold w.h.p. when the entries of e are drawn from an ct-stable distribution with a < 2 . 


Finally, recall that if f3 is a global solution for the Lasso and A ^ 
■^ 2 -error bound 


X T t 


, we have the 


— /3* II 2 < cVk ■ max / A, - X 

l n ooJ 


with high probability [8]. This establishes the inconsistency of the Lasso estimator. 
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